Merge branch '0.3.72'
This commit is contained in:
1
.gitignore
vendored
1
.gitignore
vendored
@@ -206,5 +206,4 @@ git_issues.py
|
||||
git_issues.md
|
||||
|
||||
.tests/
|
||||
|
||||
.issues/
|
||||
207
CHANGELOG.md
207
CHANGELOG.md
@@ -1,5 +1,212 @@
|
||||
# Changelog
|
||||
|
||||
## [v0.3.73] - 2024-10-24
|
||||
|
||||
### Added
|
||||
- Smart overlay removal system in AsyncPlaywrightCrawlerStrategy:
|
||||
- Automatic removal of popups, modals, and cookie notices
|
||||
- Detection and removal of fixed/sticky position elements
|
||||
- Cleaning of empty block elements
|
||||
- Configurable via `remove_overlay_elements` parameter
|
||||
- Enhanced screenshot capabilities:
|
||||
- Added `screenshot_wait_for` parameter to control timing
|
||||
- Improved screenshot handling with existing page context
|
||||
- Better error handling with fallback error images
|
||||
- New URL normalization utilities:
|
||||
- `normalize_url` function for consistent URL formatting
|
||||
- `is_external_url` function for better link classification
|
||||
- Custom base directory support for cache storage:
|
||||
- New `base_directory` parameter in AsyncWebCrawler
|
||||
- Allows specifying alternative locations for `.crawl4ai` folder
|
||||
|
||||
### Enhanced
|
||||
- Link handling improvements:
|
||||
- Better duplicate link detection
|
||||
- Enhanced internal/external link classification
|
||||
- Improved handling of special URL protocols
|
||||
- Support for anchor links and protocol-relative URLs
|
||||
- Configuration refinements:
|
||||
- Streamlined social media domain list
|
||||
- More focused external content filtering
|
||||
- LLM extraction strategy:
|
||||
- Added support for separate API base URL via `api_base` parameter
|
||||
- Better handling of base URLs in configuration
|
||||
|
||||
### Fixed
|
||||
- Screenshot functionality:
|
||||
- Resolved issues with screenshot timing and context
|
||||
- Improved error handling and recovery
|
||||
- Link processing:
|
||||
- Fixed URL normalization edge cases
|
||||
- Better handling of invalid URLs
|
||||
- Improved error messages for link processing failures
|
||||
|
||||
### Developer Notes
|
||||
- The overlay removal system uses advanced JavaScript injection for better compatibility
|
||||
- URL normalization handles special cases like mailto:, tel:, and protocol-relative URLs
|
||||
- Screenshot system now reuses existing page context for better performance
|
||||
- Link processing maintains separate dictionaries for internal and external links to ensure uniqueness
|
||||
|
||||
## [v0.3.72] - 2024-10-22
|
||||
|
||||
### Added
|
||||
- New `ContentCleaningStrategy` class:
|
||||
- Smart content extraction based on text density and element scoring
|
||||
- Automatic removal of boilerplate content
|
||||
- DOM tree analysis for better content identification
|
||||
- Configurable thresholds for content detection
|
||||
- Advanced proxy support:
|
||||
- Added `proxy_config` option for authenticated proxy connections
|
||||
- Support for username/password in proxy configuration
|
||||
- New content output formats:
|
||||
- `fit_markdown`: Optimized markdown output with main content focus
|
||||
- `fit_html`: Clean HTML with only essential content
|
||||
|
||||
### Enhanced
|
||||
- Image source detection:
|
||||
- Support for multiple image source attributes (`src`, `data-src`, `srcset`, etc.)
|
||||
- Automatic fallback through potential source attributes
|
||||
- Smart handling of srcset attribute
|
||||
- External content handling:
|
||||
- Made external link exclusion optional (disabled by default)
|
||||
- Improved detection and handling of social media links
|
||||
- Better control over external image filtering
|
||||
|
||||
### Fixed
|
||||
- Image extraction reliability with multiple source attribute checks
|
||||
- External link and image handling logic for better accuracy
|
||||
|
||||
### Developer Notes
|
||||
- The new `ContentCleaningStrategy` uses configurable thresholds for customization
|
||||
- Proxy configuration now supports more complex authentication scenarios
|
||||
- Content extraction process now provides both regular and optimized outputs
|
||||
|
||||
## [v0.3.72] - 2024-10-20
|
||||
|
||||
### Fixed
|
||||
- Added support for parsing Base64 encoded images in WebScrappingStrategy
|
||||
|
||||
### Added
|
||||
- Forked and integrated a customized version of the html2text library for more control over Markdown generation
|
||||
- New configuration options for controlling external content:
|
||||
- Ability to exclude all external links
|
||||
- Option to specify domains to exclude (default includes major social media platforms)
|
||||
- Control over excluding external images
|
||||
|
||||
### Changed
|
||||
- Improved Markdown generation process:
|
||||
- Added fine-grained control over character escaping in Markdown output
|
||||
- Enhanced handling of code blocks and pre-formatted text
|
||||
- Updated `AsyncPlaywrightCrawlerStrategy.close()` method to use a shorter sleep time (0.5 seconds instead of 500)
|
||||
- Enhanced flexibility in `CosineStrategy` with a more generic `load_HF_embedding_model` function
|
||||
|
||||
### Improved
|
||||
- Optimized content scraping and processing for better efficiency
|
||||
- Enhanced error handling and logging in various components
|
||||
|
||||
### Developer Notes
|
||||
- The customized html2text library is now located within the crawl4ai package
|
||||
- New configuration options are available in the `config.py` file for external content handling
|
||||
- The `WebScrappingStrategy` class has been updated to accommodate new external content exclusion options
|
||||
|
||||
## [v0.3.71] - 2024-10-19
|
||||
|
||||
### Added
|
||||
- New chunking strategies:
|
||||
- `OverlappingWindowChunking`: Allows for overlapping chunks of text, useful for maintaining context between chunks.
|
||||
- Enhanced `SlidingWindowChunking`: Improved to handle edge cases and last chunks more effectively.
|
||||
|
||||
### Changed
|
||||
- Updated `CHUNK_TOKEN_THRESHOLD` in config to 2048 tokens (2^11) for better compatibility with most LLM models.
|
||||
- Improved `AsyncPlaywrightCrawlerStrategy.close()` method to use a shorter sleep time (0.5 seconds instead of 500), significantly reducing wait time when closing the crawler.
|
||||
- Enhanced flexibility in `CosineStrategy`:
|
||||
- Now uses a more generic `load_HF_embedding_model` function, allowing for easier swapping of embedding models.
|
||||
- Updated `JsonCssExtractionStrategy` and `JsonXPATHExtractionStrategy` for better JSON-based extraction.
|
||||
|
||||
### Fixed
|
||||
- Addressed potential issues with the sliding window chunking strategy to ensure all text is properly chunked.
|
||||
|
||||
### Developer Notes
|
||||
- Added more comprehensive docstrings to chunking strategies for better code documentation.
|
||||
- Removed hardcoded device setting in `CosineStrategy`, now using the automatically detected device.
|
||||
- Added a new example in `quickstart_async.py` for generating a knowledge graph from crawled content.
|
||||
|
||||
These updates aim to provide more flexibility in text processing, improve performance, and enhance the overall capabilities of the crawl4ai library. The new chunking strategies, in particular, offer more options for handling large texts in various scenarios.
|
||||
|
||||
## [v0.3.71] - 2024-10-18
|
||||
|
||||
### Changes
|
||||
1. **Version Update**:
|
||||
- Updated version number from 0.3.7 to 0.3.71.
|
||||
|
||||
2. **Crawler Enhancements**:
|
||||
- Added `sleep_on_close` option to AsyncPlaywrightCrawlerStrategy for delayed browser closure.
|
||||
- Improved context creation with additional options:
|
||||
- Enabled `accept_downloads` and `java_script_enabled`.
|
||||
- Added a cookie to enable cookies by default.
|
||||
|
||||
3. **Error Handling Improvements**:
|
||||
- Enhanced error messages in AsyncWebCrawler's `arun` method.
|
||||
- Updated error reporting format for better visibility and consistency.
|
||||
|
||||
4. **Performance Optimization**:
|
||||
- Commented out automatic page and context closure in `crawl` method to potentially improve performance in certain scenarios.
|
||||
|
||||
### Documentation
|
||||
- Updated quickstart notebook:
|
||||
- Changed installation command to use the released package instead of GitHub repository.
|
||||
- Updated kernel display name.
|
||||
|
||||
### Developer Notes
|
||||
- Minor code refactoring and cleanup.
|
||||
|
||||
## [v0.3.7] - 2024-10-17
|
||||
|
||||
### New Features
|
||||
1. **Enhanced Browser Stealth**:
|
||||
- Implemented `playwright_stealth` for improved bot detection avoidance.
|
||||
- Added `StealthConfig` for fine-tuned control over stealth parameters.
|
||||
|
||||
2. **User Simulation**:
|
||||
- New `simulate_user` option to mimic human-like interactions (mouse movements, clicks, keyboard presses).
|
||||
|
||||
3. **Navigator Override**:
|
||||
- Added `override_navigator` option to modify navigator properties, further improving bot detection evasion.
|
||||
|
||||
4. **Improved iframe Handling**:
|
||||
- New `process_iframes` parameter to extract and integrate iframe content into the main page.
|
||||
|
||||
5. **Flexible Browser Selection**:
|
||||
- Support for choosing between Chromium, Firefox, and WebKit browsers.
|
||||
|
||||
6. **Include Links in Markdown**:
|
||||
- Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.
|
||||
|
||||
### Improvements
|
||||
1. **Better Error Handling**:
|
||||
- Enhanced error reporting in WebScrappingStrategy with detailed error messages and suggestions.
|
||||
- Added console message and error logging for better debugging.
|
||||
|
||||
2. **Image Processing Enhancements**:
|
||||
- Improved image dimension updating and filtering logic.
|
||||
|
||||
3. **Crawling Flexibility**:
|
||||
- Added support for custom viewport sizes.
|
||||
- Implemented delayed content retrieval with `delay_before_return_html` parameter.
|
||||
|
||||
4. **Performance Optimization**:
|
||||
- Adjusted default semaphore count for parallel crawling.
|
||||
|
||||
### Bug Fixes
|
||||
- Fixed an issue where the HTML content could be empty after processing.
|
||||
|
||||
### Examples
|
||||
- Added new example `crawl_with_user_simulation()` demonstrating the use of user simulation and navigator override features.
|
||||
|
||||
### Developer Notes
|
||||
- Refactored code for better maintainability and readability.
|
||||
- Updated browser launch arguments for improved compatibility and performance.
|
||||
|
||||
## [v0.3.6] - 2024-10-12
|
||||
|
||||
### 1. Improved Crawling Control
|
||||
|
||||
32
README.md
32
README.md
@@ -8,16 +8,14 @@
|
||||
|
||||
Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
|
||||
|
||||
> Looking for the synchronous version? Check out [README.sync.md](./README.sync.md). You can also access the previous version in the branch [V0.2.76](https://github.com/unclecode/crawl4ai/blob/v0.2.76).
|
||||
## New in 0.3.72 ✨
|
||||
|
||||
## New update 0.3.6
|
||||
- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
|
||||
- 🖼️ Improved image processing with lazy-loading detection
|
||||
- 🔧 Custom page timeout parameter for better control over crawling behavior
|
||||
- 🕰️ Enhanced handling of delayed content loading
|
||||
- 🔑 Custom headers support for LLM interactions
|
||||
- 🖼️ iframe content extraction for comprehensive page analysis
|
||||
- ⏱️ Flexible timeout and delayed content retrieval options
|
||||
- 📄 Fit markdown generation for extracting main article content.
|
||||
- 🪄 Magic mode for comprehensive anti-bot detection bypass.
|
||||
- 🌐 Enhanced multi-browser support with seamless switching (Chromium, Firefox, WebKit)
|
||||
- 📚 New chunking strategies(Sliding window, Overlapping window, Flexible size control)
|
||||
- 💾 Improved caching system for better performance
|
||||
- ⚡ Optimized batch processing with automatic rate limiting
|
||||
|
||||
## Try it Now!
|
||||
|
||||
@@ -30,22 +28,28 @@ Crawl4AI simplifies asynchronous web crawling and data extraction, making it acc
|
||||
- 🆓 Completely free and open-source
|
||||
- 🚀 Blazing fast performance, outperforming many paid services
|
||||
- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
|
||||
- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
|
||||
- 🌍 Supports crawling multiple URLs simultaneously
|
||||
- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
|
||||
- 🔗 Extracts all external and internal links
|
||||
- 📚 Extracts metadata from the page
|
||||
- 🔄 Custom hooks for authentication, headers, and page modifications before crawling
|
||||
- 🔄 Custom hooks for authentication, headers, and page modifications
|
||||
- 🕵️ User-agent customization
|
||||
- 🖼️ Takes screenshots of the page
|
||||
- 🖼️ Takes screenshots of pages with enhanced error handling
|
||||
- 📜 Executes multiple custom JavaScripts before crawling
|
||||
- 📊 Generates structured output without LLM using JsonCssExtractionStrategy
|
||||
- 📚 Various chunking strategies: topic-based, regex, sentence, and more
|
||||
- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
|
||||
- 🎯 CSS selector support for precise data extraction
|
||||
- 📝 Passes instructions/keywords to refine extraction
|
||||
- 🔒 Proxy support for enhanced privacy and access
|
||||
- 🔄 Session management for complex multi-page crawling scenarios
|
||||
- 🌐 Asynchronous architecture for improved performance and scalability
|
||||
- 🔒 Proxy support with authentication for enhanced access
|
||||
- 🔄 Session management for complex multi-page crawling
|
||||
- 🌐 Asynchronous architecture for improved performance
|
||||
- 🖼️ Improved image processing with lazy-loading detection
|
||||
- 🕰️ Enhanced handling of delayed content loading
|
||||
- 🔑 Custom headers support for LLM interactions
|
||||
- 🖼️ iframe content extraction for comprehensive analysis
|
||||
- ⏱️ Flexible timeout and delayed content retrieval options
|
||||
|
||||
## Installation 🛠️
|
||||
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
from .async_webcrawler import AsyncWebCrawler
|
||||
from .models import CrawlResult
|
||||
|
||||
__version__ = "0.3.6"
|
||||
__version__ = "0.3.72"
|
||||
|
||||
__all__ = [
|
||||
"AsyncWebCrawler",
|
||||
|
||||
558
crawl4ai/async_crawler_strategy copy.py
Normal file
558
crawl4ai/async_crawler_strategy copy.py
Normal file
@@ -0,0 +1,558 @@
|
||||
import asyncio
|
||||
import base64
|
||||
import time
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Callable, Dict, Any, List, Optional, Awaitable
|
||||
import os
|
||||
from playwright.async_api import async_playwright, Page, Browser, Error
|
||||
from io import BytesIO
|
||||
from PIL import Image, ImageDraw, ImageFont
|
||||
from pathlib import Path
|
||||
from playwright.async_api import ProxySettings
|
||||
from pydantic import BaseModel
|
||||
import hashlib
|
||||
import json
|
||||
import uuid
|
||||
from playwright_stealth import stealth_async
|
||||
|
||||
class AsyncCrawlResponse(BaseModel):
|
||||
html: str
|
||||
response_headers: Dict[str, str]
|
||||
status_code: int
|
||||
screenshot: Optional[str] = None
|
||||
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
|
||||
|
||||
class Config:
|
||||
arbitrary_types_allowed = True
|
||||
|
||||
class AsyncCrawlerStrategy(ABC):
|
||||
@abstractmethod
|
||||
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def take_screenshot(self, url: str) -> str:
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def update_user_agent(self, user_agent: str):
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def set_hook(self, hook_type: str, hook: Callable):
|
||||
pass
|
||||
|
||||
class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
def __init__(self, use_cached_html=False, js_code=None, **kwargs):
|
||||
self.use_cached_html = use_cached_html
|
||||
self.user_agent = kwargs.get(
|
||||
"user_agent",
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
|
||||
)
|
||||
self.proxy = kwargs.get("proxy")
|
||||
self.headless = kwargs.get("headless", True)
|
||||
self.browser_type = kwargs.get("browser_type", "chromium")
|
||||
self.headers = kwargs.get("headers", {})
|
||||
self.sessions = {}
|
||||
self.session_ttl = 1800
|
||||
self.js_code = js_code
|
||||
self.verbose = kwargs.get("verbose", False)
|
||||
self.playwright = None
|
||||
self.browser = None
|
||||
self.hooks = {
|
||||
'on_browser_created': None,
|
||||
'on_user_agent_updated': None,
|
||||
'on_execution_started': None,
|
||||
'before_goto': None,
|
||||
'after_goto': None,
|
||||
'before_return_html': None,
|
||||
'before_retrieve_html': None
|
||||
}
|
||||
|
||||
async def __aenter__(self):
|
||||
await self.start()
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||
await self.close()
|
||||
|
||||
async def start(self):
|
||||
if self.playwright is None:
|
||||
self.playwright = await async_playwright().start()
|
||||
if self.browser is None:
|
||||
browser_args = {
|
||||
"headless": self.headless,
|
||||
"args": [
|
||||
"--disable-gpu",
|
||||
"--no-sandbox",
|
||||
"--disable-dev-shm-usage",
|
||||
"--disable-blink-features=AutomationControlled",
|
||||
"--disable-infobars",
|
||||
"--window-position=0,0",
|
||||
"--ignore-certificate-errors",
|
||||
"--ignore-certificate-errors-spki-list",
|
||||
# "--headless=new", # Use the new headless mode
|
||||
]
|
||||
}
|
||||
|
||||
# Add proxy settings if a proxy is specified
|
||||
if self.proxy:
|
||||
proxy_settings = ProxySettings(server=self.proxy)
|
||||
browser_args["proxy"] = proxy_settings
|
||||
|
||||
# Select the appropriate browser based on the browser_type
|
||||
if self.browser_type == "firefox":
|
||||
self.browser = await self.playwright.firefox.launch(**browser_args)
|
||||
elif self.browser_type == "webkit":
|
||||
self.browser = await self.playwright.webkit.launch(**browser_args)
|
||||
else:
|
||||
self.browser = await self.playwright.chromium.launch(**browser_args)
|
||||
|
||||
await self.execute_hook('on_browser_created', self.browser)
|
||||
|
||||
async def close(self):
|
||||
if self.browser:
|
||||
await self.browser.close()
|
||||
self.browser = None
|
||||
if self.playwright:
|
||||
await self.playwright.stop()
|
||||
self.playwright = None
|
||||
|
||||
def __del__(self):
|
||||
if self.browser or self.playwright:
|
||||
asyncio.get_event_loop().run_until_complete(self.close())
|
||||
|
||||
def set_hook(self, hook_type: str, hook: Callable):
|
||||
if hook_type in self.hooks:
|
||||
self.hooks[hook_type] = hook
|
||||
else:
|
||||
raise ValueError(f"Invalid hook type: {hook_type}")
|
||||
|
||||
async def execute_hook(self, hook_type: str, *args):
|
||||
hook = self.hooks.get(hook_type)
|
||||
if hook:
|
||||
if asyncio.iscoroutinefunction(hook):
|
||||
return await hook(*args)
|
||||
else:
|
||||
return hook(*args)
|
||||
return args[0] if args else None
|
||||
|
||||
def update_user_agent(self, user_agent: str):
|
||||
self.user_agent = user_agent
|
||||
|
||||
def set_custom_headers(self, headers: Dict[str, str]):
|
||||
self.headers = headers
|
||||
|
||||
async def kill_session(self, session_id: str):
|
||||
if session_id in self.sessions:
|
||||
context, page, _ = self.sessions[session_id]
|
||||
await page.close()
|
||||
await context.close()
|
||||
del self.sessions[session_id]
|
||||
|
||||
def _cleanup_expired_sessions(self):
|
||||
current_time = time.time()
|
||||
expired_sessions = [
|
||||
sid for sid, (_, _, last_used) in self.sessions.items()
|
||||
if current_time - last_used > self.session_ttl
|
||||
]
|
||||
for sid in expired_sessions:
|
||||
asyncio.create_task(self.kill_session(sid))
|
||||
|
||||
async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
|
||||
wait_for = wait_for.strip()
|
||||
|
||||
if wait_for.startswith('js:'):
|
||||
# Explicitly specified JavaScript
|
||||
js_code = wait_for[3:].strip()
|
||||
return await self.csp_compliant_wait(page, js_code, timeout)
|
||||
elif wait_for.startswith('css:'):
|
||||
# Explicitly specified CSS selector
|
||||
css_selector = wait_for[4:].strip()
|
||||
try:
|
||||
await page.wait_for_selector(css_selector, timeout=timeout)
|
||||
except Error as e:
|
||||
if 'Timeout' in str(e):
|
||||
raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{css_selector}'")
|
||||
else:
|
||||
raise ValueError(f"Invalid CSS selector: '{css_selector}'")
|
||||
else:
|
||||
# Auto-detect based on content
|
||||
if wait_for.startswith('()') or wait_for.startswith('function'):
|
||||
# It's likely a JavaScript function
|
||||
return await self.csp_compliant_wait(page, wait_for, timeout)
|
||||
else:
|
||||
# Assume it's a CSS selector first
|
||||
try:
|
||||
await page.wait_for_selector(wait_for, timeout=timeout)
|
||||
except Error as e:
|
||||
if 'Timeout' in str(e):
|
||||
raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{wait_for}'")
|
||||
else:
|
||||
# If it's not a timeout error, it might be an invalid selector
|
||||
# Let's try to evaluate it as a JavaScript function as a fallback
|
||||
try:
|
||||
return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
|
||||
except Error:
|
||||
raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
|
||||
"It should be either a valid CSS selector, a JavaScript function, "
|
||||
"or explicitly prefixed with 'js:' or 'css:'.")
|
||||
|
||||
async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
|
||||
wrapper_js = f"""
|
||||
async () => {{
|
||||
const userFunction = {user_wait_function};
|
||||
const startTime = Date.now();
|
||||
while (true) {{
|
||||
if (await userFunction()) {{
|
||||
return true;
|
||||
}}
|
||||
if (Date.now() - startTime > {timeout}) {{
|
||||
throw new Error('Timeout waiting for condition');
|
||||
}}
|
||||
await new Promise(resolve => setTimeout(resolve, 100));
|
||||
}}
|
||||
}}
|
||||
"""
|
||||
|
||||
try:
|
||||
await page.evaluate(wrapper_js)
|
||||
except TimeoutError:
|
||||
raise TimeoutError(f"Timeout after {timeout}ms waiting for condition")
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Error in wait condition: {str(e)}")
|
||||
|
||||
async def process_iframes(self, page):
|
||||
# Find all iframes
|
||||
iframes = await page.query_selector_all('iframe')
|
||||
|
||||
for i, iframe in enumerate(iframes):
|
||||
try:
|
||||
# Add a unique identifier to the iframe
|
||||
await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
|
||||
|
||||
# Get the frame associated with this iframe
|
||||
frame = await iframe.content_frame()
|
||||
|
||||
if frame:
|
||||
# Wait for the frame to load
|
||||
await frame.wait_for_load_state('load', timeout=30000) # 30 seconds timeout
|
||||
|
||||
# Extract the content of the iframe's body
|
||||
iframe_content = await frame.evaluate('() => document.body.innerHTML')
|
||||
|
||||
# Generate a unique class name for this iframe
|
||||
class_name = f'extracted-iframe-content-{i}'
|
||||
|
||||
# Replace the iframe with a div containing the extracted content
|
||||
_iframe = iframe_content.replace('`', '\\`')
|
||||
await page.evaluate(f"""
|
||||
() => {{
|
||||
const iframe = document.getElementById('iframe-{i}');
|
||||
const div = document.createElement('div');
|
||||
div.innerHTML = `{_iframe}`;
|
||||
div.className = '{class_name}';
|
||||
iframe.replaceWith(div);
|
||||
}}
|
||||
""")
|
||||
else:
|
||||
print(f"Warning: Could not access content frame for iframe {i}")
|
||||
except Exception as e:
|
||||
print(f"Error processing iframe {i}: {str(e)}")
|
||||
|
||||
# Return the page object
|
||||
return page
|
||||
|
||||
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
|
||||
response_headers = {}
|
||||
status_code = None
|
||||
|
||||
self._cleanup_expired_sessions()
|
||||
session_id = kwargs.get("session_id")
|
||||
if session_id:
|
||||
context, page, _ = self.sessions.get(session_id, (None, None, None))
|
||||
if not context:
|
||||
context = await self.browser.new_context(
|
||||
user_agent=self.user_agent,
|
||||
viewport={"width": 1920, "height": 1080},
|
||||
proxy={"server": self.proxy} if self.proxy else None
|
||||
)
|
||||
await context.set_extra_http_headers(self.headers)
|
||||
page = await context.new_page()
|
||||
self.sessions[session_id] = (context, page, time.time())
|
||||
else:
|
||||
context = await self.browser.new_context(
|
||||
user_agent=self.user_agent,
|
||||
viewport={"width": 1920, "height": 1080},
|
||||
proxy={"server": self.proxy} if self.proxy else None
|
||||
)
|
||||
await context.set_extra_http_headers(self.headers)
|
||||
|
||||
if kwargs.get("override_navigator", False):
|
||||
# Inject scripts to override navigator properties
|
||||
await context.add_init_script("""
|
||||
// Pass the Permissions Test.
|
||||
const originalQuery = window.navigator.permissions.query;
|
||||
window.navigator.permissions.query = (parameters) => (
|
||||
parameters.name === 'notifications' ?
|
||||
Promise.resolve({ state: Notification.permission }) :
|
||||
originalQuery(parameters)
|
||||
);
|
||||
Object.defineProperty(navigator, 'webdriver', {
|
||||
get: () => undefined
|
||||
});
|
||||
window.navigator.chrome = {
|
||||
runtime: {},
|
||||
// Add other properties if necessary
|
||||
};
|
||||
Object.defineProperty(navigator, 'plugins', {
|
||||
get: () => [1, 2, 3, 4, 5],
|
||||
});
|
||||
Object.defineProperty(navigator, 'languages', {
|
||||
get: () => ['en-US', 'en'],
|
||||
});
|
||||
Object.defineProperty(document, 'hidden', {
|
||||
get: () => false
|
||||
});
|
||||
Object.defineProperty(document, 'visibilityState', {
|
||||
get: () => 'visible'
|
||||
});
|
||||
""")
|
||||
|
||||
page = await context.new_page()
|
||||
|
||||
try:
|
||||
if self.verbose:
|
||||
print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
|
||||
|
||||
if self.use_cached_html:
|
||||
cache_file_path = os.path.join(
|
||||
Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
|
||||
)
|
||||
if os.path.exists(cache_file_path):
|
||||
html = ""
|
||||
with open(cache_file_path, "r") as f:
|
||||
html = f.read()
|
||||
# retrieve response headers and status code from cache
|
||||
with open(cache_file_path + ".meta", "r") as f:
|
||||
meta = json.load(f)
|
||||
response_headers = meta.get("response_headers", {})
|
||||
status_code = meta.get("status_code")
|
||||
response = AsyncCrawlResponse(
|
||||
html=html, response_headers=response_headers, status_code=status_code
|
||||
)
|
||||
return response
|
||||
|
||||
if not kwargs.get("js_only", False):
|
||||
await self.execute_hook('before_goto', page)
|
||||
|
||||
response = await page.goto("about:blank")
|
||||
await stealth_async(page)
|
||||
response = await page.goto(
|
||||
url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
|
||||
)
|
||||
|
||||
# await stealth_async(page)
|
||||
# response = await page.goto("about:blank")
|
||||
# await stealth_async(page)
|
||||
# await page.evaluate(f"window.location.href = '{url}'")
|
||||
|
||||
await self.execute_hook('after_goto', page)
|
||||
|
||||
# Get status code and headers
|
||||
status_code = response.status
|
||||
response_headers = response.headers
|
||||
else:
|
||||
status_code = 200
|
||||
response_headers = {}
|
||||
|
||||
await page.wait_for_selector('body')
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
|
||||
js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
|
||||
if js_code:
|
||||
if isinstance(js_code, str):
|
||||
await page.evaluate(js_code)
|
||||
elif isinstance(js_code, list):
|
||||
for js in js_code:
|
||||
await page.evaluate(js)
|
||||
|
||||
await page.wait_for_load_state('networkidle')
|
||||
# Check for on execution event
|
||||
await self.execute_hook('on_execution_started', page)
|
||||
|
||||
if kwargs.get("simulate_user", False):
|
||||
# Simulate user interactions
|
||||
await page.mouse.move(100, 100)
|
||||
await page.mouse.down()
|
||||
await page.mouse.up()
|
||||
await page.keyboard.press('ArrowDown')
|
||||
|
||||
# Handle the wait_for parameter
|
||||
wait_for = kwargs.get("wait_for")
|
||||
if wait_for:
|
||||
try:
|
||||
await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Wait condition failed: {str(e)}")
|
||||
|
||||
|
||||
|
||||
# Update image dimensions
|
||||
update_image_dimensions_js = """
|
||||
() => {
|
||||
return new Promise((resolve) => {
|
||||
const filterImage = (img) => {
|
||||
// Filter out images that are too small
|
||||
if (img.width < 100 && img.height < 100) return false;
|
||||
|
||||
// Filter out images that are not visible
|
||||
const rect = img.getBoundingClientRect();
|
||||
if (rect.width === 0 || rect.height === 0) return false;
|
||||
|
||||
// Filter out images with certain class names (e.g., icons, thumbnails)
|
||||
if (img.classList.contains('icon') || img.classList.contains('thumbnail')) return false;
|
||||
|
||||
// Filter out images with certain patterns in their src (e.g., placeholder images)
|
||||
if (img.src.includes('placeholder') || img.src.includes('icon')) return false;
|
||||
|
||||
return true;
|
||||
};
|
||||
|
||||
const images = Array.from(document.querySelectorAll('img')).filter(filterImage);
|
||||
let imagesLeft = images.length;
|
||||
|
||||
if (imagesLeft === 0) {
|
||||
resolve();
|
||||
return;
|
||||
}
|
||||
|
||||
const checkImage = (img) => {
|
||||
if (img.complete && img.naturalWidth !== 0) {
|
||||
img.setAttribute('width', img.naturalWidth);
|
||||
img.setAttribute('height', img.naturalHeight);
|
||||
imagesLeft--;
|
||||
if (imagesLeft === 0) resolve();
|
||||
}
|
||||
};
|
||||
|
||||
images.forEach(img => {
|
||||
checkImage(img);
|
||||
if (!img.complete) {
|
||||
img.onload = () => {
|
||||
checkImage(img);
|
||||
};
|
||||
img.onerror = () => {
|
||||
imagesLeft--;
|
||||
if (imagesLeft === 0) resolve();
|
||||
};
|
||||
}
|
||||
});
|
||||
|
||||
// Fallback timeout of 5 seconds
|
||||
setTimeout(() => resolve(), 5000);
|
||||
});
|
||||
}
|
||||
"""
|
||||
await page.evaluate(update_image_dimensions_js)
|
||||
|
||||
# Wait a bit for any onload events to complete
|
||||
await page.wait_for_timeout(100)
|
||||
|
||||
# Process iframes
|
||||
if kwargs.get("process_iframes", False):
|
||||
page = await self.process_iframes(page)
|
||||
|
||||
await self.execute_hook('before_retrieve_html', page)
|
||||
# Check if delay_before_return_html is set then wait for that time
|
||||
delay_before_return_html = kwargs.get("delay_before_return_html")
|
||||
if delay_before_return_html:
|
||||
await asyncio.sleep(delay_before_return_html)
|
||||
|
||||
html = await page.content()
|
||||
await self.execute_hook('before_return_html', page, html)
|
||||
|
||||
# Check if kwargs has screenshot=True then take screenshot
|
||||
screenshot_data = None
|
||||
if kwargs.get("screenshot"):
|
||||
screenshot_data = await self.take_screenshot(url)
|
||||
|
||||
if self.verbose:
|
||||
print(f"[LOG] ✅ Crawled {url} successfully!")
|
||||
|
||||
if self.use_cached_html:
|
||||
cache_file_path = os.path.join(
|
||||
Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
|
||||
)
|
||||
with open(cache_file_path, "w", encoding="utf-8") as f:
|
||||
f.write(html)
|
||||
# store response headers and status code in cache
|
||||
with open(cache_file_path + ".meta", "w", encoding="utf-8") as f:
|
||||
json.dump({
|
||||
"response_headers": response_headers,
|
||||
"status_code": status_code
|
||||
}, f)
|
||||
|
||||
async def get_delayed_content(delay: float = 5.0) -> str:
|
||||
if self.verbose:
|
||||
print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
|
||||
await asyncio.sleep(delay)
|
||||
return await page.content()
|
||||
|
||||
response = AsyncCrawlResponse(
|
||||
html=html,
|
||||
response_headers=response_headers,
|
||||
status_code=status_code,
|
||||
screenshot=screenshot_data,
|
||||
get_delayed_content=get_delayed_content
|
||||
)
|
||||
return response
|
||||
except Error as e:
|
||||
raise Error(f"Failed to crawl {url}: {str(e)}")
|
||||
finally:
|
||||
if not session_id:
|
||||
await page.close()
|
||||
await context.close()
|
||||
|
||||
async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
|
||||
semaphore_count = kwargs.get('semaphore_count', 5) # Adjust as needed
|
||||
semaphore = asyncio.Semaphore(semaphore_count)
|
||||
|
||||
async def crawl_with_semaphore(url):
|
||||
async with semaphore:
|
||||
return await self.crawl(url, **kwargs)
|
||||
|
||||
tasks = [crawl_with_semaphore(url) for url in urls]
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
return [result if not isinstance(result, Exception) else str(result) for result in results]
|
||||
|
||||
async def take_screenshot(self, url: str, wait_time=1000) -> str:
|
||||
async with await self.browser.new_context(user_agent=self.user_agent) as context:
|
||||
page = await context.new_page()
|
||||
try:
|
||||
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
|
||||
# Wait for a specified time (default is 1 second)
|
||||
await page.wait_for_timeout(wait_time)
|
||||
screenshot = await page.screenshot(full_page=True)
|
||||
return base64.b64encode(screenshot).decode('utf-8')
|
||||
except Exception as e:
|
||||
error_message = f"Failed to take screenshot: {str(e)}"
|
||||
print(error_message)
|
||||
|
||||
# Generate an error image
|
||||
img = Image.new('RGB', (800, 600), color='black')
|
||||
draw = ImageDraw.Draw(img)
|
||||
font = ImageFont.load_default()
|
||||
draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
|
||||
|
||||
buffered = BytesIO()
|
||||
img.save(buffered, format="JPEG")
|
||||
return base64.b64encode(buffered.getvalue()).decode('utf-8')
|
||||
finally:
|
||||
await page.close()
|
||||
|
||||
@@ -1,17 +1,35 @@
|
||||
import asyncio
|
||||
import base64, time
|
||||
import base64
|
||||
import time
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Callable, Dict, Any, List, Optional, Awaitable
|
||||
import os
|
||||
from playwright.async_api import async_playwright, Page, Browser, Error
|
||||
from io import BytesIO
|
||||
from PIL import Image, ImageDraw, ImageFont
|
||||
from .utils import sanitize_input_encode, calculate_semaphore_count
|
||||
import json, uuid
|
||||
import hashlib
|
||||
from pathlib import Path
|
||||
from playwright.async_api import ProxySettings
|
||||
from pydantic import BaseModel
|
||||
import hashlib
|
||||
import json
|
||||
import uuid
|
||||
from playwright_stealth import StealthConfig, stealth_async
|
||||
|
||||
stealth_config = StealthConfig(
|
||||
webdriver=True,
|
||||
chrome_app=True,
|
||||
chrome_csi=True,
|
||||
chrome_load_times=True,
|
||||
chrome_runtime=True,
|
||||
navigator_languages=True,
|
||||
navigator_plugins=True,
|
||||
navigator_permissions=True,
|
||||
webgl_vendor=True,
|
||||
outerdimensions=True,
|
||||
navigator_hardware_concurrency=True,
|
||||
media_codecs=True,
|
||||
)
|
||||
|
||||
|
||||
class AsyncCrawlResponse(BaseModel):
|
||||
html: str
|
||||
@@ -33,7 +51,7 @@ class AsyncCrawlerStrategy(ABC):
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def take_screenshot(self, url: str) -> str:
|
||||
async def take_screenshot(self, **kwargs) -> str:
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
@@ -47,10 +65,15 @@ class AsyncCrawlerStrategy(ABC):
|
||||
class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
def __init__(self, use_cached_html=False, js_code=None, **kwargs):
|
||||
self.use_cached_html = use_cached_html
|
||||
self.user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
|
||||
self.user_agent = kwargs.get(
|
||||
"user_agent",
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
|
||||
)
|
||||
self.proxy = kwargs.get("proxy")
|
||||
self.proxy_config = kwargs.get("proxy_config")
|
||||
self.headless = kwargs.get("headless", True)
|
||||
self.browser_type = kwargs.get("browser_type", "chromium") # New parameter
|
||||
self.browser_type = kwargs.get("browser_type", "chromium")
|
||||
self.headers = kwargs.get("headers", {})
|
||||
self.sessions = {}
|
||||
self.session_ttl = 1800
|
||||
@@ -58,6 +81,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
self.verbose = kwargs.get("verbose", False)
|
||||
self.playwright = None
|
||||
self.browser = None
|
||||
self.sleep_on_close = kwargs.get("sleep_on_close", False)
|
||||
self.hooks = {
|
||||
'on_browser_created': None,
|
||||
'on_user_agent_updated': None,
|
||||
@@ -83,9 +107,14 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
"headless": self.headless,
|
||||
"args": [
|
||||
"--disable-gpu",
|
||||
"--disable-dev-shm-usage",
|
||||
"--disable-setuid-sandbox",
|
||||
"--no-sandbox",
|
||||
"--disable-dev-shm-usage",
|
||||
"--disable-blink-features=AutomationControlled",
|
||||
"--disable-infobars",
|
||||
"--window-position=0,0",
|
||||
"--ignore-certificate-errors",
|
||||
"--ignore-certificate-errors-spki-list",
|
||||
# "--headless=new", # Use the new headless mode
|
||||
]
|
||||
}
|
||||
|
||||
@@ -93,7 +122,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if self.proxy:
|
||||
proxy_settings = ProxySettings(server=self.proxy)
|
||||
browser_args["proxy"] = proxy_settings
|
||||
|
||||
elif self.proxy_config:
|
||||
proxy_settings = ProxySettings(server=self.proxy_config.get("server"), username=self.proxy_config.get("username"), password=self.proxy_config.get("password"))
|
||||
browser_args["proxy"] = proxy_settings
|
||||
|
||||
# Select the appropriate browser based on the browser_type
|
||||
if self.browser_type == "firefox":
|
||||
@@ -106,6 +137,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
await self.execute_hook('on_browser_created', self.browser)
|
||||
|
||||
async def close(self):
|
||||
if self.sleep_on_close:
|
||||
await asyncio.sleep(0.5)
|
||||
if self.browser:
|
||||
await self.browser.close()
|
||||
self.browser = None
|
||||
@@ -147,8 +180,10 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
def _cleanup_expired_sessions(self):
|
||||
current_time = time.time()
|
||||
expired_sessions = [sid for sid, (_, _, last_used) in self.sessions.items()
|
||||
if current_time - last_used > self.session_ttl]
|
||||
expired_sessions = [
|
||||
sid for sid, (_, _, last_used) in self.sessions.items()
|
||||
if current_time - last_used > self.session_ttl
|
||||
]
|
||||
for sid in expired_sessions:
|
||||
asyncio.create_task(self.kill_session(sid))
|
||||
|
||||
@@ -188,8 +223,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
|
||||
except Error:
|
||||
raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
|
||||
"It should be either a valid CSS selector, a JavaScript function, "
|
||||
"or explicitly prefixed with 'js:' or 'css:'.")
|
||||
"It should be either a valid CSS selector, a JavaScript function, "
|
||||
"or explicitly prefixed with 'js:' or 'css:'.")
|
||||
|
||||
async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
|
||||
wrapper_js = f"""
|
||||
@@ -254,8 +289,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
print(f"Error processing iframe {i}: {str(e)}")
|
||||
|
||||
# Return the page object
|
||||
return page
|
||||
|
||||
return page
|
||||
|
||||
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
|
||||
response_headers = {}
|
||||
@@ -268,25 +302,70 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if not context:
|
||||
context = await self.browser.new_context(
|
||||
user_agent=self.user_agent,
|
||||
proxy={"server": self.proxy} if self.proxy else None
|
||||
viewport={"width": 1920, "height": 1080},
|
||||
proxy={"server": self.proxy} if self.proxy else None,
|
||||
accept_downloads=True,
|
||||
java_script_enabled=True
|
||||
)
|
||||
await context.add_cookies([{"name": "cookiesEnabled", "value": "true", "url": url}])
|
||||
await context.set_extra_http_headers(self.headers)
|
||||
page = await context.new_page()
|
||||
self.sessions[session_id] = (context, page, time.time())
|
||||
else:
|
||||
context = await self.browser.new_context(
|
||||
user_agent=self.user_agent,
|
||||
proxy={"server": self.proxy} if self.proxy else None
|
||||
user_agent=self.user_agent,
|
||||
viewport={"width": 1920, "height": 1080},
|
||||
proxy={"server": self.proxy} if self.proxy else None
|
||||
)
|
||||
await context.set_extra_http_headers(self.headers)
|
||||
|
||||
if kwargs.get("override_navigator", False) or kwargs.get("simulate_user", False) or kwargs.get("magic", False):
|
||||
# Inject scripts to override navigator properties
|
||||
await context.add_init_script("""
|
||||
// Pass the Permissions Test.
|
||||
const originalQuery = window.navigator.permissions.query;
|
||||
window.navigator.permissions.query = (parameters) => (
|
||||
parameters.name === 'notifications' ?
|
||||
Promise.resolve({ state: Notification.permission }) :
|
||||
originalQuery(parameters)
|
||||
);
|
||||
Object.defineProperty(navigator, 'webdriver', {
|
||||
get: () => undefined
|
||||
});
|
||||
window.navigator.chrome = {
|
||||
runtime: {},
|
||||
// Add other properties if necessary
|
||||
};
|
||||
Object.defineProperty(navigator, 'plugins', {
|
||||
get: () => [1, 2, 3, 4, 5],
|
||||
});
|
||||
Object.defineProperty(navigator, 'languages', {
|
||||
get: () => ['en-US', 'en'],
|
||||
});
|
||||
Object.defineProperty(document, 'hidden', {
|
||||
get: () => false
|
||||
});
|
||||
Object.defineProperty(document, 'visibilityState', {
|
||||
get: () => 'visible'
|
||||
});
|
||||
""")
|
||||
|
||||
page = await context.new_page()
|
||||
# await stealth_async(page) #, stealth_config)
|
||||
|
||||
# Add console message and error logging
|
||||
if kwargs.get("log_console", False):
|
||||
page.on("console", lambda msg: print(f"Console: {msg.text}"))
|
||||
page.on("pageerror", lambda exc: print(f"Page Error: {exc}"))
|
||||
|
||||
try:
|
||||
if self.verbose:
|
||||
print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
|
||||
|
||||
if self.use_cached_html:
|
||||
cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest())
|
||||
cache_file_path = os.path.join(
|
||||
Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
|
||||
)
|
||||
if os.path.exists(cache_file_path):
|
||||
html = ""
|
||||
with open(cache_file_path, "r") as f:
|
||||
@@ -296,12 +375,21 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
meta = json.load(f)
|
||||
response_headers = meta.get("response_headers", {})
|
||||
status_code = meta.get("status_code")
|
||||
response = AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
|
||||
response = AsyncCrawlResponse(
|
||||
html=html, response_headers=response_headers, status_code=status_code
|
||||
)
|
||||
return response
|
||||
|
||||
if not kwargs.get("js_only", False):
|
||||
await self.execute_hook('before_goto', page)
|
||||
response = await page.goto(url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000))
|
||||
|
||||
response = await page.goto(
|
||||
url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
|
||||
)
|
||||
|
||||
# response = await page.goto("about:blank")
|
||||
# await page.evaluate(f"window.location.href = '{url}'")
|
||||
|
||||
await self.execute_hook('after_goto', page)
|
||||
|
||||
# Get status code and headers
|
||||
@@ -311,37 +399,30 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
status_code = 200
|
||||
response_headers = {}
|
||||
|
||||
|
||||
await page.wait_for_selector('body')
|
||||
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
|
||||
js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
|
||||
if js_code:
|
||||
if isinstance(js_code, str):
|
||||
r = await page.evaluate(js_code)
|
||||
await page.evaluate(js_code)
|
||||
elif isinstance(js_code, list):
|
||||
for js in js_code:
|
||||
await page.evaluate(js)
|
||||
|
||||
# await page.wait_for_timeout(100)
|
||||
await page.wait_for_load_state('networkidle')
|
||||
# Check for on execution even
|
||||
# Check for on execution event
|
||||
await self.execute_hook('on_execution_started', page)
|
||||
|
||||
# New code to handle the wait_for parameter
|
||||
# Example usage:
|
||||
# await crawler.crawl(
|
||||
# url,
|
||||
# js_code="// some JavaScript code",
|
||||
# wait_for="""() => {
|
||||
# return document.querySelector('#my-element') !== null;
|
||||
# }"""
|
||||
# )
|
||||
# Example of using a CSS selector:
|
||||
# await crawler.crawl(
|
||||
# url,
|
||||
# wait_for="#my-element"
|
||||
# )
|
||||
if kwargs.get("simulate_user", False) or kwargs.get("magic", False):
|
||||
# Simulate user interactions
|
||||
await page.mouse.move(100, 100)
|
||||
await page.mouse.down()
|
||||
await page.mouse.up()
|
||||
await page.keyboard.press('ArrowDown')
|
||||
|
||||
# Handle the wait_for parameter
|
||||
wait_for = kwargs.get("wait_for")
|
||||
if wait_for:
|
||||
try:
|
||||
@@ -349,13 +430,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Wait condition failed: {str(e)}")
|
||||
|
||||
# Check if kwargs has screenshot=True then take screenshot
|
||||
screenshot_data = None
|
||||
if kwargs.get("screenshot"):
|
||||
screenshot_data = await self.take_screenshot(url)
|
||||
|
||||
|
||||
# New code to update image dimensions
|
||||
# Update image dimensions
|
||||
update_image_dimensions_js = """
|
||||
() => {
|
||||
return new Promise((resolve) => {
|
||||
@@ -407,7 +482,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
});
|
||||
|
||||
// Fallback timeout of 5 seconds
|
||||
setTimeout(() => resolve(), 5000);
|
||||
// setTimeout(() => resolve(), 5000);
|
||||
resolve();
|
||||
});
|
||||
}
|
||||
"""
|
||||
@@ -426,14 +502,29 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if delay_before_return_html:
|
||||
await asyncio.sleep(delay_before_return_html)
|
||||
|
||||
# Check for remove_overlay_elements parameter
|
||||
if kwargs.get("remove_overlay_elements", False):
|
||||
await self.remove_overlay_elements(page)
|
||||
|
||||
html = await page.content()
|
||||
await self.execute_hook('before_return_html', page, html)
|
||||
|
||||
# Check if kwargs has screenshot=True then take screenshot
|
||||
screenshot_data = None
|
||||
if kwargs.get("screenshot"):
|
||||
# Check we have screenshot_wait_for parameter, if we have simply wait for that time
|
||||
screenshot_wait_for = kwargs.get("screenshot_wait_for")
|
||||
if screenshot_wait_for:
|
||||
await asyncio.sleep(screenshot_wait_for)
|
||||
screenshot_data = await self.take_screenshot(page)
|
||||
|
||||
if self.verbose:
|
||||
print(f"[LOG] ✅ Crawled {url} successfully!")
|
||||
|
||||
if self.use_cached_html:
|
||||
cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest())
|
||||
cache_file_path = os.path.join(
|
||||
Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
|
||||
)
|
||||
with open(cache_file_path, "w", encoding="utf-8") as f:
|
||||
f.write(html)
|
||||
# store response headers and status code in cache
|
||||
@@ -443,7 +534,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
"status_code": status_code
|
||||
}, f)
|
||||
|
||||
|
||||
async def get_delayed_content(delay: float = 5.0) -> str:
|
||||
if self.verbose:
|
||||
print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
|
||||
@@ -459,63 +549,14 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
)
|
||||
return response
|
||||
except Error as e:
|
||||
raise Error(f"Failed to crawl {url}: {str(e)}")
|
||||
finally:
|
||||
if not session_id:
|
||||
await page.close()
|
||||
raise Error(f"[ERROR] 🚫 crawl(): Failed to crawl {url}: {str(e)}")
|
||||
# finally:
|
||||
# if not session_id:
|
||||
# await page.close()
|
||||
# await context.close()
|
||||
|
||||
# try:
|
||||
# html = await _crawl()
|
||||
# return sanitize_input_encode(html)
|
||||
# except Error as e:
|
||||
# raise Error(f"Failed to crawl {url}: {str(e)}")
|
||||
# except Exception as e:
|
||||
# raise Exception(f"Failed to crawl {url}: {str(e)}")
|
||||
|
||||
async def execute_js(self, session_id: str, js_code: str, wait_for_js: str = None, wait_for_css: str = None) -> AsyncCrawlResponse:
|
||||
"""
|
||||
Execute JavaScript code in a specific session and optionally wait for a condition.
|
||||
|
||||
:param session_id: The ID of the session to execute the JS code in.
|
||||
:param js_code: The JavaScript code to execute.
|
||||
:param wait_for_js: JavaScript condition to wait for after execution.
|
||||
:param wait_for_css: CSS selector to wait for after execution.
|
||||
:return: AsyncCrawlResponse containing the page's HTML and other information.
|
||||
:raises ValueError: If the session does not exist.
|
||||
"""
|
||||
if not session_id:
|
||||
raise ValueError("Session ID must be provided")
|
||||
|
||||
if session_id not in self.sessions:
|
||||
raise ValueError(f"No active session found for session ID: {session_id}")
|
||||
|
||||
context, page, last_used = self.sessions[session_id]
|
||||
|
||||
try:
|
||||
await page.evaluate(js_code)
|
||||
|
||||
if wait_for_js:
|
||||
await page.wait_for_function(wait_for_js)
|
||||
|
||||
if wait_for_css:
|
||||
await page.wait_for_selector(wait_for_css)
|
||||
|
||||
# Get the updated HTML content
|
||||
html = await page.content()
|
||||
|
||||
# Get response headers and status code (assuming these are available)
|
||||
response_headers = await page.evaluate("() => JSON.stringify(performance.getEntriesByType('resource')[0].responseHeaders)")
|
||||
status_code = await page.evaluate("() => performance.getEntriesByType('resource')[0].responseStatus")
|
||||
|
||||
# Update the last used time for this session
|
||||
self.sessions[session_id] = (context, page, time.time())
|
||||
|
||||
return AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
|
||||
except Error as e:
|
||||
raise Error(f"Failed to execute JavaScript or wait for condition in session {session_id}: {str(e)}")
|
||||
|
||||
async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
|
||||
semaphore_count = kwargs.get('semaphore_count', calculate_semaphore_count())
|
||||
semaphore_count = kwargs.get('semaphore_count', 5) # Adjust as needed
|
||||
semaphore = asyncio.Semaphore(semaphore_count)
|
||||
|
||||
async def crawl_with_semaphore(url):
|
||||
@@ -526,27 +567,156 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
return [result if not isinstance(result, Exception) else str(result) for result in results]
|
||||
|
||||
async def take_screenshot(self, url: str, wait_time = 1000) -> str:
|
||||
async with await self.browser.new_context(user_agent=self.user_agent) as context:
|
||||
page = await context.new_page()
|
||||
try:
|
||||
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
|
||||
# Wait for a specified time (default is 1 second)
|
||||
await page.wait_for_timeout(wait_time)
|
||||
screenshot = await page.screenshot(full_page=True)
|
||||
return base64.b64encode(screenshot).decode('utf-8')
|
||||
except Exception as e:
|
||||
error_message = f"Failed to take screenshot: {str(e)}"
|
||||
print(error_message)
|
||||
async def remove_overlay_elements(self, page: Page) -> None:
|
||||
"""
|
||||
Removes popup overlays, modals, cookie notices, and other intrusive elements from the page.
|
||||
|
||||
Args:
|
||||
page (Page): The Playwright page instance
|
||||
"""
|
||||
remove_overlays_js = """
|
||||
async () => {
|
||||
// Function to check if element is visible
|
||||
const isVisible = (elem) => {
|
||||
const style = window.getComputedStyle(elem);
|
||||
return style.display !== 'none' &&
|
||||
style.visibility !== 'hidden' &&
|
||||
style.opacity !== '0';
|
||||
};
|
||||
|
||||
# Generate an error image
|
||||
img = Image.new('RGB', (800, 600), color='black')
|
||||
draw = ImageDraw.Draw(img)
|
||||
font = ImageFont.load_default()
|
||||
draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
|
||||
// Common selectors for popups and overlays
|
||||
const commonSelectors = [
|
||||
// Close buttons first
|
||||
'button[class*="close" i]', 'button[class*="dismiss" i]',
|
||||
'button[aria-label*="close" i]', 'button[title*="close" i]',
|
||||
'a[class*="close" i]', 'span[class*="close" i]',
|
||||
|
||||
buffered = BytesIO()
|
||||
img.save(buffered, format="JPEG")
|
||||
return base64.b64encode(buffered.getvalue()).decode('utf-8')
|
||||
finally:
|
||||
await page.close()
|
||||
// Cookie notices
|
||||
'[class*="cookie-banner" i]', '[id*="cookie-banner" i]',
|
||||
'[class*="cookie-consent" i]', '[id*="cookie-consent" i]',
|
||||
|
||||
// Newsletter/subscription dialogs
|
||||
'[class*="newsletter" i]', '[class*="subscribe" i]',
|
||||
|
||||
// Generic popups/modals
|
||||
'[class*="popup" i]', '[class*="modal" i]',
|
||||
'[class*="overlay" i]', '[class*="dialog" i]',
|
||||
'[role="dialog"]', '[role="alertdialog"]'
|
||||
];
|
||||
|
||||
// Try to click close buttons first
|
||||
for (const selector of commonSelectors.slice(0, 6)) {
|
||||
const closeButtons = document.querySelectorAll(selector);
|
||||
for (const button of closeButtons) {
|
||||
if (isVisible(button)) {
|
||||
try {
|
||||
button.click();
|
||||
await new Promise(resolve => setTimeout(resolve, 100));
|
||||
} catch (e) {
|
||||
console.log('Error clicking button:', e);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Remove remaining overlay elements
|
||||
const removeOverlays = () => {
|
||||
// Find elements with high z-index
|
||||
const allElements = document.querySelectorAll('*');
|
||||
for (const elem of allElements) {
|
||||
const style = window.getComputedStyle(elem);
|
||||
const zIndex = parseInt(style.zIndex);
|
||||
const position = style.position;
|
||||
|
||||
if (
|
||||
isVisible(elem) &&
|
||||
(zIndex > 999 || position === 'fixed' || position === 'absolute') &&
|
||||
(
|
||||
elem.offsetWidth > window.innerWidth * 0.5 ||
|
||||
elem.offsetHeight > window.innerHeight * 0.5 ||
|
||||
style.backgroundColor.includes('rgba') ||
|
||||
parseFloat(style.opacity) < 1
|
||||
)
|
||||
) {
|
||||
elem.remove();
|
||||
}
|
||||
}
|
||||
|
||||
// Remove elements matching common selectors
|
||||
for (const selector of commonSelectors) {
|
||||
const elements = document.querySelectorAll(selector);
|
||||
elements.forEach(elem => {
|
||||
if (isVisible(elem)) {
|
||||
elem.remove();
|
||||
}
|
||||
});
|
||||
}
|
||||
};
|
||||
|
||||
// Remove overlay elements
|
||||
removeOverlays();
|
||||
|
||||
// Remove any fixed/sticky position elements at the top/bottom
|
||||
const removeFixedElements = () => {
|
||||
const elements = document.querySelectorAll('*');
|
||||
elements.forEach(elem => {
|
||||
const style = window.getComputedStyle(elem);
|
||||
if (
|
||||
(style.position === 'fixed' || style.position === 'sticky') &&
|
||||
isVisible(elem)
|
||||
) {
|
||||
elem.remove();
|
||||
}
|
||||
});
|
||||
};
|
||||
|
||||
removeFixedElements();
|
||||
|
||||
// Remove empty block elements as: div, p, span, etc.
|
||||
const removeEmptyBlockElements = () => {
|
||||
const blockElements = document.querySelectorAll('div, p, span, section, article, header, footer, aside, nav, main, ul, ol, li, dl, dt, dd, h1, h2, h3, h4, h5, h6');
|
||||
blockElements.forEach(elem => {
|
||||
if (elem.innerText.trim() === '') {
|
||||
elem.remove();
|
||||
}
|
||||
});
|
||||
};
|
||||
|
||||
// Remove margin-right and padding-right from body (often added by modal scripts)
|
||||
document.body.style.marginRight = '0px';
|
||||
document.body.style.paddingRight = '0px';
|
||||
document.body.style.overflow = 'auto';
|
||||
|
||||
// Wait a bit for any animations to complete
|
||||
await new Promise(resolve => setTimeout(resolve, 100));
|
||||
}
|
||||
"""
|
||||
|
||||
try:
|
||||
await page.evaluate(remove_overlays_js)
|
||||
await page.wait_for_timeout(500) # Wait for any animations to complete
|
||||
except Exception as e:
|
||||
if self.verbose:
|
||||
print(f"Warning: Failed to remove overlay elements: {str(e)}")
|
||||
|
||||
async def take_screenshot(self, page: Page) -> str:
|
||||
try:
|
||||
# The page is already loaded, just take the screenshot
|
||||
screenshot = await page.screenshot(full_page=True)
|
||||
return base64.b64encode(screenshot).decode('utf-8')
|
||||
except Exception as e:
|
||||
error_message = f"Failed to take screenshot: {str(e)}"
|
||||
print(error_message)
|
||||
|
||||
# Generate an error image
|
||||
img = Image.new('RGB', (800, 600), color='black')
|
||||
draw = ImageDraw.Draw(img)
|
||||
font = ImageFont.load_default()
|
||||
draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
|
||||
|
||||
buffered = BytesIO()
|
||||
img.save(buffered, format="JPEG")
|
||||
return base64.b64encode(buffered.getvalue()).decode('utf-8')
|
||||
finally:
|
||||
await page.close()
|
||||
|
||||
|
||||
@@ -30,6 +30,7 @@ class AsyncWebCrawler:
|
||||
**kwargs
|
||||
)
|
||||
self.always_by_pass_cache = always_by_pass_cache
|
||||
# self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
|
||||
self.crawl4ai_folder = os.path.join(base_directory, ".crawl4ai")
|
||||
os.makedirs(self.crawl4ai_folder, exist_ok=True)
|
||||
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
|
||||
@@ -134,8 +135,8 @@ class AsyncWebCrawler:
|
||||
except Exception as e:
|
||||
if not hasattr(e, "msg"):
|
||||
e.msg = str(e)
|
||||
print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
|
||||
return CrawlResult(url=url, html="", success=False, error_message=e.msg)
|
||||
print(f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}")
|
||||
return CrawlResult(url=url, html="", markdown = f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}", success=False, error_message=e.msg)
|
||||
|
||||
async def arun_many(
|
||||
self,
|
||||
@@ -187,7 +188,8 @@ class AsyncWebCrawler:
|
||||
try:
|
||||
t1 = time.time()
|
||||
scrapping_strategy = WebScrappingStrategy()
|
||||
result = await scrapping_strategy.ascrap(
|
||||
# result = await scrapping_strategy.ascrap(
|
||||
result = scrapping_strategy.scrap(
|
||||
url,
|
||||
html,
|
||||
word_count_threshold=word_count_threshold,
|
||||
@@ -196,6 +198,7 @@ class AsyncWebCrawler:
|
||||
image_description_min_word_threshold=kwargs.get(
|
||||
"image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
|
||||
),
|
||||
**kwargs,
|
||||
)
|
||||
if verbose:
|
||||
print(
|
||||
@@ -211,6 +214,8 @@ class AsyncWebCrawler:
|
||||
|
||||
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
|
||||
markdown = sanitize_input_encode(result.get("markdown", ""))
|
||||
fit_markdown = sanitize_input_encode(result.get("fit_markdown", ""))
|
||||
fit_html = sanitize_input_encode(result.get("fit_html", ""))
|
||||
media = result.get("media", [])
|
||||
links = result.get("links", [])
|
||||
metadata = result.get("metadata", {})
|
||||
@@ -257,6 +262,8 @@ class AsyncWebCrawler:
|
||||
html=html,
|
||||
cleaned_html=format_html(cleaned_html),
|
||||
markdown=markdown,
|
||||
fit_markdown=fit_markdown,
|
||||
fit_html= fit_html,
|
||||
media=media,
|
||||
links=links,
|
||||
metadata=metadata,
|
||||
|
||||
@@ -84,6 +84,12 @@ class TopicSegmentationChunking(ChunkingStrategy):
|
||||
# Fixed-length word chunks
|
||||
class FixedLengthWordChunking(ChunkingStrategy):
|
||||
def __init__(self, chunk_size=100, **kwargs):
|
||||
"""
|
||||
Initialize the fixed-length word chunking strategy with the given chunk size.
|
||||
|
||||
Args:
|
||||
chunk_size (int): The size of each chunk in words.
|
||||
"""
|
||||
self.chunk_size = chunk_size
|
||||
|
||||
def chunk(self, text: str) -> list:
|
||||
@@ -93,14 +99,64 @@ class FixedLengthWordChunking(ChunkingStrategy):
|
||||
# Sliding window chunking
|
||||
class SlidingWindowChunking(ChunkingStrategy):
|
||||
def __init__(self, window_size=100, step=50, **kwargs):
|
||||
"""
|
||||
Initialize the sliding window chunking strategy with the given window size and
|
||||
step size.
|
||||
|
||||
Args:
|
||||
window_size (int): The size of the sliding window in words.
|
||||
step (int): The step size for sliding the window in words.
|
||||
"""
|
||||
self.window_size = window_size
|
||||
self.step = step
|
||||
|
||||
def chunk(self, text: str) -> list:
|
||||
words = text.split()
|
||||
chunks = []
|
||||
for i in range(0, len(words), self.step):
|
||||
chunks.append(' '.join(words[i:i + self.window_size]))
|
||||
|
||||
if len(words) <= self.window_size:
|
||||
return [text]
|
||||
|
||||
for i in range(0, len(words) - self.window_size + 1, self.step):
|
||||
chunk = ' '.join(words[i:i + self.window_size])
|
||||
chunks.append(chunk)
|
||||
|
||||
# Handle the last chunk if it doesn't align perfectly
|
||||
if i + self.window_size < len(words):
|
||||
chunks.append(' '.join(words[-self.window_size:]))
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
class OverlappingWindowChunking(ChunkingStrategy):
|
||||
def __init__(self, window_size=1000, overlap=100, **kwargs):
|
||||
"""
|
||||
Initialize the overlapping window chunking strategy with the given window size and
|
||||
overlap size.
|
||||
|
||||
Args:
|
||||
window_size (int): The size of the window in words.
|
||||
overlap (int): The size of the overlap between consecutive chunks in words.
|
||||
"""
|
||||
self.window_size = window_size
|
||||
self.overlap = overlap
|
||||
|
||||
def chunk(self, text: str) -> list:
|
||||
words = text.split()
|
||||
chunks = []
|
||||
|
||||
if len(words) <= self.window_size:
|
||||
return [text]
|
||||
|
||||
start = 0
|
||||
while start < len(words):
|
||||
end = start + self.window_size
|
||||
chunk = ' '.join(words[start:end])
|
||||
chunks.append(chunk)
|
||||
|
||||
if end >= len(words):
|
||||
break
|
||||
|
||||
start = end - self.overlap
|
||||
|
||||
return chunks
|
||||
@@ -4,24 +4,23 @@ from dotenv import load_dotenv
|
||||
load_dotenv() # Load environment variables from .env file
|
||||
|
||||
# Default provider, ONLY used when the extraction strategy is LLMExtractionStrategy
|
||||
DEFAULT_PROVIDER = "openai/gpt-4-turbo"
|
||||
DEFAULT_PROVIDER = "openai/gpt-4o-mini"
|
||||
MODEL_REPO_BRANCH = "new-release-0.0.2"
|
||||
# Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
|
||||
PROVIDER_MODELS = {
|
||||
"ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
|
||||
"groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
|
||||
"groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
|
||||
"openai/gpt-3.5-turbo": os.getenv("OPENAI_API_KEY"),
|
||||
"openai/gpt-4-turbo": os.getenv("OPENAI_API_KEY"),
|
||||
"openai/gpt-4o-mini": os.getenv("OPENAI_API_KEY"),
|
||||
"openai/gpt-4o": os.getenv("OPENAI_API_KEY"),
|
||||
"anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
|
||||
"anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
|
||||
"anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
|
||||
"anthropic/claude-3-5-sonnet-20240620": os.getenv("ANTHROPIC_API_KEY"),
|
||||
}
|
||||
|
||||
|
||||
# Chunk token threshold
|
||||
CHUNK_TOKEN_THRESHOLD = 500
|
||||
CHUNK_TOKEN_THRESHOLD = 2 ** 11 # 2048 tokens
|
||||
OVERLAP_RATE = 0.1
|
||||
WORD_TOKEN_RATE = 1.3
|
||||
|
||||
@@ -29,6 +28,20 @@ WORD_TOKEN_RATE = 1.3
|
||||
MIN_WORD_THRESHOLD = 1
|
||||
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD = 1
|
||||
|
||||
IMPORTANT_ATTRS = ['src', 'href', 'alt', 'title', 'width', 'height']
|
||||
ONLY_TEXT_ELIGIBLE_TAGS = ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark']
|
||||
SOCIAL_MEDIA_DOMAINS = [
|
||||
'facebook.com',
|
||||
'twitter.com',
|
||||
'x.com',
|
||||
'linkedin.com',
|
||||
'instagram.com',
|
||||
'pinterest.com',
|
||||
'tiktok.com',
|
||||
'snapchat.com',
|
||||
'reddit.com',
|
||||
]
|
||||
|
||||
# Threshold for the Image extraction - Range is 1 to 6
|
||||
# Images are scored based on point based system, to filter based on usefulness. Points are assigned
|
||||
# to each image based on the following aspects.
|
||||
|
||||
196
crawl4ai/content_cleaning_strategy.py
Normal file
196
crawl4ai/content_cleaning_strategy.py
Normal file
@@ -0,0 +1,196 @@
|
||||
from bs4 import BeautifulSoup, Tag
|
||||
import re
|
||||
from typing import Optional
|
||||
|
||||
class ContentCleaningStrategy:
|
||||
def __init__(self):
|
||||
# Precompile regex patterns for performance
|
||||
self.negative_patterns = re.compile(r'nav|footer|header|sidebar|ads|comment', re.I)
|
||||
self.positive_patterns = re.compile(r'content|article|main|post', re.I)
|
||||
self.priority_tags = {'article', 'main', 'section', 'div'}
|
||||
self.non_content_tags = {'nav', 'footer', 'header', 'aside'}
|
||||
# Thresholds
|
||||
self.text_density_threshold = 9.0
|
||||
self.min_word_count = 50
|
||||
self.link_density_threshold = 0.2
|
||||
self.max_dom_depth = 10 # To prevent excessive DOM traversal
|
||||
|
||||
def clean(self, clean_html: str) -> str:
|
||||
"""
|
||||
Main function that takes cleaned HTML and returns super cleaned HTML.
|
||||
|
||||
Args:
|
||||
clean_html (str): The cleaned HTML content.
|
||||
|
||||
Returns:
|
||||
str: The super cleaned HTML containing only the main content.
|
||||
"""
|
||||
try:
|
||||
if not clean_html or not isinstance(clean_html, str):
|
||||
return ''
|
||||
soup = BeautifulSoup(clean_html, 'html.parser')
|
||||
main_content = self.extract_main_content(soup)
|
||||
if main_content:
|
||||
super_clean_element = self.clean_element(main_content)
|
||||
return str(super_clean_element)
|
||||
else:
|
||||
return ''
|
||||
except Exception:
|
||||
# Handle exceptions silently or log them as needed
|
||||
return ''
|
||||
|
||||
def extract_main_content(self, soup: BeautifulSoup) -> Optional[Tag]:
|
||||
"""
|
||||
Identifies and extracts the main content element from the HTML.
|
||||
|
||||
Args:
|
||||
soup (BeautifulSoup): The parsed HTML soup.
|
||||
|
||||
Returns:
|
||||
Optional[Tag]: The Tag object containing the main content, or None if not found.
|
||||
"""
|
||||
candidates = []
|
||||
for element in soup.find_all(self.priority_tags):
|
||||
if self.is_non_content_tag(element):
|
||||
continue
|
||||
if self.has_negative_class_id(element):
|
||||
continue
|
||||
score = self.calculate_content_score(element)
|
||||
candidates.append((score, element))
|
||||
|
||||
if not candidates:
|
||||
return None
|
||||
|
||||
# Sort candidates by score in descending order
|
||||
candidates.sort(key=lambda x: x[0], reverse=True)
|
||||
# Select the element with the highest score
|
||||
best_element = candidates[0][1]
|
||||
return best_element
|
||||
|
||||
def calculate_content_score(self, element: Tag) -> float:
|
||||
"""
|
||||
Calculates a score for an element based on various heuristics.
|
||||
|
||||
Args:
|
||||
element (Tag): The HTML element to score.
|
||||
|
||||
Returns:
|
||||
float: The content score of the element.
|
||||
"""
|
||||
score = 0.0
|
||||
|
||||
if self.is_priority_tag(element):
|
||||
score += 5.0
|
||||
if self.has_positive_class_id(element):
|
||||
score += 3.0
|
||||
if self.has_negative_class_id(element):
|
||||
score -= 3.0
|
||||
if self.is_high_text_density(element):
|
||||
score += 2.0
|
||||
if self.is_low_link_density(element):
|
||||
score += 2.0
|
||||
if self.has_sufficient_content(element):
|
||||
score += 2.0
|
||||
if self.has_headings(element):
|
||||
score += 3.0
|
||||
|
||||
dom_depth = self.calculate_dom_depth(element)
|
||||
score += min(dom_depth, self.max_dom_depth) * 0.5 # Adjust weight as needed
|
||||
|
||||
return score
|
||||
|
||||
def is_priority_tag(self, element: Tag) -> bool:
|
||||
"""Checks if the element is a priority tag."""
|
||||
return element.name in self.priority_tags
|
||||
|
||||
def is_non_content_tag(self, element: Tag) -> bool:
|
||||
"""Checks if the element is a non-content tag."""
|
||||
return element.name in self.non_content_tags
|
||||
|
||||
def has_negative_class_id(self, element: Tag) -> bool:
|
||||
"""Checks if the element has negative indicators in its class or id."""
|
||||
class_id = ' '.join(filter(None, [
|
||||
self.get_attr_str(element.get('class')),
|
||||
element.get('id', '')
|
||||
]))
|
||||
return bool(self.negative_patterns.search(class_id))
|
||||
|
||||
def has_positive_class_id(self, element: Tag) -> bool:
|
||||
"""Checks if the element has positive indicators in its class or id."""
|
||||
class_id = ' '.join(filter(None, [
|
||||
self.get_attr_str(element.get('class')),
|
||||
element.get('id', '')
|
||||
]))
|
||||
return bool(self.positive_patterns.search(class_id))
|
||||
|
||||
@staticmethod
|
||||
def get_attr_str(attr) -> str:
|
||||
"""Converts an attribute value to a string."""
|
||||
if isinstance(attr, list):
|
||||
return ' '.join(attr)
|
||||
elif isinstance(attr, str):
|
||||
return attr
|
||||
else:
|
||||
return ''
|
||||
|
||||
def is_high_text_density(self, element: Tag) -> bool:
|
||||
"""Determines if the element has high text density."""
|
||||
text_density = self.calculate_text_density(element)
|
||||
return text_density > self.text_density_threshold
|
||||
|
||||
def calculate_text_density(self, element: Tag) -> float:
|
||||
"""Calculates the text density of an element."""
|
||||
text_length = len(element.get_text(strip=True))
|
||||
tag_count = len(element.find_all())
|
||||
tag_count = tag_count or 1 # Prevent division by zero
|
||||
return text_length / tag_count
|
||||
|
||||
def is_low_link_density(self, element: Tag) -> bool:
|
||||
"""Determines if the element has low link density."""
|
||||
link_density = self.calculate_link_density(element)
|
||||
return link_density < self.link_density_threshold
|
||||
|
||||
def calculate_link_density(self, element: Tag) -> float:
|
||||
"""Calculates the link density of an element."""
|
||||
text = element.get_text(strip=True)
|
||||
if not text:
|
||||
return 0.0
|
||||
link_text = ' '.join(a.get_text(strip=True) for a in element.find_all('a'))
|
||||
return len(link_text) / len(text) if text else 0.0
|
||||
|
||||
def has_sufficient_content(self, element: Tag) -> bool:
|
||||
"""Checks if the element has sufficient word count."""
|
||||
word_count = len(element.get_text(strip=True).split())
|
||||
return word_count >= self.min_word_count
|
||||
|
||||
def calculate_dom_depth(self, element: Tag) -> int:
|
||||
"""Calculates the depth of an element in the DOM tree."""
|
||||
depth = 0
|
||||
current_element = element
|
||||
while current_element.parent and depth < self.max_dom_depth:
|
||||
depth += 1
|
||||
current_element = current_element.parent
|
||||
return depth
|
||||
|
||||
def has_headings(self, element: Tag) -> bool:
|
||||
"""Checks if the element contains heading tags."""
|
||||
return bool(element.find(['h1', 'h2', 'h3']))
|
||||
|
||||
def clean_element(self, element: Tag) -> Tag:
|
||||
"""
|
||||
Cleans the selected element by removing unnecessary attributes and nested non-content elements.
|
||||
|
||||
Args:
|
||||
element (Tag): The HTML element to clean.
|
||||
|
||||
Returns:
|
||||
Tag: The cleaned HTML element.
|
||||
"""
|
||||
for tag in element.find_all(['script', 'style', 'aside']):
|
||||
tag.decompose()
|
||||
for tag in element.find_all():
|
||||
attrs = dict(tag.attrs)
|
||||
for attr in attrs:
|
||||
if attr in ['style', 'onclick', 'onmouseover', 'align', 'bgcolor']:
|
||||
del tag.attrs[attr]
|
||||
return element
|
||||
@@ -7,13 +7,17 @@ from .config import *
|
||||
from bs4 import element, NavigableString, Comment
|
||||
from urllib.parse import urljoin
|
||||
from requests.exceptions import InvalidSchema
|
||||
from .content_cleaning_strategy import ContentCleaningStrategy
|
||||
|
||||
from .utils import (
|
||||
sanitize_input_encode,
|
||||
sanitize_html,
|
||||
extract_metadata,
|
||||
InvalidCSSSelectorError,
|
||||
CustomHTML2Text
|
||||
CustomHTML2Text,
|
||||
normalize_url,
|
||||
is_external_url
|
||||
|
||||
)
|
||||
|
||||
class ContentScrappingStrategy(ABC):
|
||||
@@ -33,12 +37,14 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)
|
||||
|
||||
def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
|
||||
success = True
|
||||
if not html:
|
||||
return None
|
||||
|
||||
soup = BeautifulSoup(html, 'html.parser')
|
||||
body = soup.body
|
||||
|
||||
|
||||
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
|
||||
|
||||
for tag in kwargs.get('excluded_tags', []) or []:
|
||||
@@ -64,6 +70,8 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
|
||||
links = {'internal': [], 'external': []}
|
||||
media = {'images': [], 'videos': [], 'audios': []}
|
||||
internal_links_dict = {}
|
||||
external_links_dict = {}
|
||||
|
||||
# Extract meaningful text for media files from closest parent
|
||||
def find_closest_parent_with_useful_text(tag):
|
||||
@@ -125,7 +133,11 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
image_width = img.get('width')
|
||||
width_value, width_unit = parse_dimension(image_width)
|
||||
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
|
||||
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
||||
image_src = img.get('src','')
|
||||
if "data:image/" in image_src:
|
||||
image_format = image_src.split(',')[0].split(';')[0].split('/')[1]
|
||||
else:
|
||||
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
||||
# Remove . from format
|
||||
image_format = image_format.strip('.').split('?')[0]
|
||||
score = 0
|
||||
@@ -149,6 +161,8 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
score+=1
|
||||
return score
|
||||
|
||||
|
||||
|
||||
if not is_valid_image(img, img.parent, img.parent.get('class', [])):
|
||||
return None
|
||||
score = score_image_for_usefulness(img, url, index, total_images)
|
||||
@@ -163,6 +177,19 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
'type': 'image'
|
||||
}
|
||||
|
||||
def remove_unwanted_attributes(element, important_attrs, keep_data_attributes=False):
|
||||
attrs_to_remove = []
|
||||
for attr in element.attrs:
|
||||
if attr not in important_attrs:
|
||||
if keep_data_attributes:
|
||||
if not attr.startswith('data-'):
|
||||
attrs_to_remove.append(attr)
|
||||
else:
|
||||
attrs_to_remove.append(attr)
|
||||
|
||||
for attr in attrs_to_remove:
|
||||
del element[attr]
|
||||
|
||||
def process_element(element: element.PageElement) -> bool:
|
||||
try:
|
||||
if isinstance(element, NavigableString):
|
||||
@@ -179,21 +206,106 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
return False
|
||||
|
||||
keep_element = False
|
||||
|
||||
exclude_social_media_domains = SOCIAL_MEDIA_DOMAINS + kwargs.get('exclude_social_media_domains', [])
|
||||
exclude_social_media_domains = list(set(exclude_social_media_domains))
|
||||
|
||||
if element.name == 'a' and element.get('href'):
|
||||
href = element['href']
|
||||
url_base = url.split('/')[2]
|
||||
link_data = {'href': href, 'text': element.get_text()}
|
||||
if href.startswith('http') and url_base not in href:
|
||||
links['external'].append(link_data)
|
||||
else:
|
||||
links['internal'].append(link_data)
|
||||
keep_element = True
|
||||
|
||||
try:
|
||||
if element.name == 'a' and element.get('href'):
|
||||
href = element.get('href', '').strip()
|
||||
if not href: # Skip empty hrefs
|
||||
return False
|
||||
|
||||
url_base = url.split('/')[2]
|
||||
|
||||
# Normalize the URL
|
||||
try:
|
||||
normalized_href = normalize_url(href, url)
|
||||
except ValueError as e:
|
||||
# logging.warning(f"Invalid URL format: {href}, Error: {str(e)}")
|
||||
return False
|
||||
|
||||
link_data = {
|
||||
'href': normalized_href,
|
||||
'text': element.get_text().strip(),
|
||||
'title': element.get('title', '').strip()
|
||||
}
|
||||
|
||||
# Check for duplicates and add to appropriate dictionary
|
||||
is_external = is_external_url(normalized_href, url_base)
|
||||
if is_external:
|
||||
if normalized_href not in external_links_dict:
|
||||
external_links_dict[normalized_href] = link_data
|
||||
else:
|
||||
if normalized_href not in internal_links_dict:
|
||||
internal_links_dict[normalized_href] = link_data
|
||||
|
||||
keep_element = True
|
||||
|
||||
# Handle external link exclusions
|
||||
if is_external:
|
||||
if kwargs.get('exclude_external_links', False):
|
||||
element.decompose()
|
||||
return False
|
||||
elif kwargs.get('exclude_social_media_links', False):
|
||||
if any(domain in normalized_href.lower() for domain in exclude_social_media_domains):
|
||||
element.decompose()
|
||||
return False
|
||||
elif kwargs.get('exclude_domains', []):
|
||||
if any(domain in normalized_href.lower() for domain in kwargs.get('exclude_domains', [])):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"Error processing links: {str(e)}")
|
||||
|
||||
elif element.name == 'img':
|
||||
return True # Always keep image elements
|
||||
|
||||
elif element.name in ['video', 'audio']:
|
||||
try:
|
||||
if element.name == 'img':
|
||||
potential_sources = ['src', 'data-src', 'srcset' 'data-lazy-src', 'data-original']
|
||||
src = element.get('src', '')
|
||||
while not src and potential_sources:
|
||||
src = element.get(potential_sources.pop(0), '')
|
||||
if not src:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
# If it is srcset pick up the first image
|
||||
if 'srcset' in element.attrs:
|
||||
src = element.attrs['srcset'].split(',')[0].split(' ')[0]
|
||||
|
||||
# Check flag if we should remove external images
|
||||
if kwargs.get('exclude_external_images', False):
|
||||
src_url_base = src.split('/')[2]
|
||||
url_base = url.split('/')[2]
|
||||
if url_base not in src_url_base:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
if not kwargs.get('exclude_external_images', False) and kwargs.get('exclude_social_media_links', False):
|
||||
src_url_base = src.split('/')[2]
|
||||
url_base = url.split('/')[2]
|
||||
if any(domain in src for domain in exclude_social_media_domains):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
# Handle exclude domains
|
||||
if kwargs.get('exclude_domains', []):
|
||||
if any(domain in src for domain in kwargs.get('exclude_domains', [])):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
return True # Always keep image elements
|
||||
except Exception as e:
|
||||
raise "Error processing images"
|
||||
|
||||
|
||||
# Check if flag to remove all forms is set
|
||||
if kwargs.get('remove_forms', False) and element.name == 'form':
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
if element.name in ['video', 'audio']:
|
||||
media[f"{element.name}s"].append({
|
||||
'src': element.get('src'),
|
||||
'alt': element.get('alt'),
|
||||
@@ -210,14 +322,15 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
})
|
||||
return True # Always keep video and audio elements
|
||||
|
||||
if element.name != 'pre':
|
||||
if element.name in ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark']:
|
||||
if kwargs.get('only_text', False):
|
||||
element.replace_with(element.get_text())
|
||||
else:
|
||||
element.unwrap()
|
||||
elif element.name != 'img':
|
||||
element.attrs = {}
|
||||
if element.name in ONLY_TEXT_ELIGIBLE_TAGS:
|
||||
if kwargs.get('only_text', False):
|
||||
element.replace_with(element.get_text())
|
||||
|
||||
try:
|
||||
remove_unwanted_attributes(element, IMPORTANT_ATTRS, kwargs.get('keep_data_attributes', False))
|
||||
except Exception as e:
|
||||
print('Error removing unwanted attributes:', str(e))
|
||||
|
||||
|
||||
# Process children
|
||||
for child in list(element.children):
|
||||
@@ -251,9 +364,15 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
# ]
|
||||
|
||||
process_element(body)
|
||||
|
||||
# Update the links dictionary with unique links
|
||||
links['internal'] = list(internal_links_dict.values())
|
||||
links['external'] = list(external_links_dict.values())
|
||||
|
||||
|
||||
# # Process images using ThreadPoolExecutor
|
||||
imgs = body.find_all('img')
|
||||
|
||||
with ThreadPoolExecutor() as executor:
|
||||
image_results = list(executor.map(process_image, imgs, [url]*len(imgs), range(len(imgs)), [len(imgs)]*len(imgs)))
|
||||
media['images'] = [result for result in image_results if result is not None]
|
||||
@@ -273,12 +392,42 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
if base64_pattern.match(src):
|
||||
# Replace base64 data with empty string
|
||||
img['src'] = base64_pattern.sub('', src)
|
||||
|
||||
try:
|
||||
str(body)
|
||||
except Exception as e:
|
||||
# Reset body to the original HTML
|
||||
success = False
|
||||
body = BeautifulSoup(html, 'html.parser')
|
||||
|
||||
# Create a new div with a special ID
|
||||
error_div = body.new_tag('div', id='crawl4ai_error_message')
|
||||
error_div.string = '''
|
||||
Crawl4AI Error: This page is not fully supported.
|
||||
|
||||
Possible reasons:
|
||||
1. The page may have restrictions that prevent crawling.
|
||||
2. The page might not be fully loaded.
|
||||
|
||||
Suggestions:
|
||||
- Try calling the crawl function with these parameters:
|
||||
magic=True,
|
||||
- Set headless=False to visualize what's happening on the page.
|
||||
|
||||
If the issue persists, please check the page's structure and any potential anti-crawling measures.
|
||||
'''
|
||||
|
||||
# Append the error div to the body
|
||||
body.body.append(error_div)
|
||||
|
||||
print(f"[LOG] 😧 Error: After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.")
|
||||
|
||||
|
||||
cleaned_html = str(body).replace('\n\n', '\n').replace(' ', ' ')
|
||||
|
||||
h = CustomHTML2Text()
|
||||
h.ignore_links = True
|
||||
h.body_width = 0
|
||||
try:
|
||||
h = CustomHTML2Text()
|
||||
h.update_params(**kwargs.get('html2text', {}))
|
||||
markdown = h.handle(cleaned_html)
|
||||
except Exception as e:
|
||||
markdown = h.handle(sanitize_html(cleaned_html))
|
||||
@@ -289,12 +438,18 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
except Exception as e:
|
||||
print('Error extracting metadata:', str(e))
|
||||
meta = {}
|
||||
|
||||
cleaner = ContentCleaningStrategy()
|
||||
fit_html = cleaner.clean(cleaned_html)
|
||||
fit_markdown = h.handle(fit_html)
|
||||
|
||||
cleaned_html = sanitize_html(cleaned_html)
|
||||
return {
|
||||
'markdown': markdown,
|
||||
'fit_markdown': fit_markdown,
|
||||
'fit_html': fit_html,
|
||||
'cleaned_html': cleaned_html,
|
||||
'success': True,
|
||||
'success': success,
|
||||
'media': media,
|
||||
'links': links,
|
||||
'metadata': meta
|
||||
|
||||
@@ -68,7 +68,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
"""
|
||||
super().__init__()
|
||||
self.provider = provider
|
||||
self.api_token = api_token or PROVIDER_MODELS.get(provider, None) or os.getenv("OPENAI_API_KEY")
|
||||
self.api_token = api_token or PROVIDER_MODELS.get(provider, "no-token") or os.getenv("OPENAI_API_KEY")
|
||||
self.instruction = instruction
|
||||
self.extract_type = extraction_type
|
||||
self.schema = schema
|
||||
@@ -80,6 +80,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
self.word_token_rate = kwargs.get("word_token_rate", WORD_TOKEN_RATE)
|
||||
self.apply_chunking = kwargs.get("apply_chunking", True)
|
||||
self.base_url = kwargs.get("base_url", None)
|
||||
self.api_base = kwargs.get("api_base", kwargs.get("base_url", None))
|
||||
self.extra_args = kwargs.get("extra_args", {})
|
||||
if not self.apply_chunking:
|
||||
self.chunk_token_threshold = 1e9
|
||||
@@ -116,7 +117,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
self.provider,
|
||||
prompt_with_variables,
|
||||
self.api_token,
|
||||
base_url=self.base_url,
|
||||
base_url=self.api_base or self.base_url,
|
||||
extra_args = self.extra_args
|
||||
) # , json_response=self.extract_type == "schema")
|
||||
try:
|
||||
@@ -234,11 +235,12 @@ class CosineStrategy(ExtractionStrategy):
|
||||
"""
|
||||
Initialize the strategy with clustering parameters.
|
||||
|
||||
:param semantic_filter: A keyword filter for document filtering.
|
||||
:param word_count_threshold: Minimum number of words per cluster.
|
||||
:param max_dist: The maximum cophenetic distance on the dendrogram to form clusters.
|
||||
:param linkage_method: The linkage method for hierarchical clustering.
|
||||
:param top_k: Number of top categories to extract.
|
||||
Args:
|
||||
semantic_filter (str): A keyword filter for document filtering.
|
||||
word_count_threshold (int): Minimum number of words per cluster.
|
||||
max_dist (float): The maximum cophenetic distance on the dendrogram to form clusters.
|
||||
linkage_method (str): The linkage method for hierarchical clustering.
|
||||
top_k (int): Number of top categories to extract.
|
||||
"""
|
||||
super().__init__()
|
||||
|
||||
@@ -257,8 +259,8 @@ class CosineStrategy(ExtractionStrategy):
|
||||
self.get_embedding_method = "direct"
|
||||
|
||||
self.device = get_device()
|
||||
import torch
|
||||
self.device = torch.device('cpu')
|
||||
# import torch
|
||||
# self.device = torch.device('cpu')
|
||||
|
||||
self.default_batch_size = calculate_batch_size(self.device)
|
||||
|
||||
@@ -271,7 +273,7 @@ class CosineStrategy(ExtractionStrategy):
|
||||
# self.get_embedding_method = "direct"
|
||||
# else:
|
||||
|
||||
self.tokenizer, self.model = load_bge_small_en_v1_5()
|
||||
self.tokenizer, self.model = load_HF_embedding_model(model_name)
|
||||
self.model.to(self.device)
|
||||
self.model.eval()
|
||||
|
||||
@@ -738,7 +740,6 @@ class JsonCssExtractionStrategy(ExtractionStrategy):
|
||||
combined_html = self.DEL.join(sections)
|
||||
return self.extract(url, combined_html, **kwargs)
|
||||
|
||||
|
||||
class JsonXPATHExtractionStrategy(ExtractionStrategy):
|
||||
def __init__(self, schema: Dict[str, Any], **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
|
||||
1015
crawl4ai/html2text/__init__.py
Normal file
1015
crawl4ai/html2text/__init__.py
Normal file
File diff suppressed because it is too large
Load Diff
3
crawl4ai/html2text/__main__.py
Normal file
3
crawl4ai/html2text/__main__.py
Normal file
@@ -0,0 +1,3 @@
|
||||
from .cli import main
|
||||
|
||||
main()
|
||||
2
crawl4ai/html2text/_typing.py
Normal file
2
crawl4ai/html2text/_typing.py
Normal file
@@ -0,0 +1,2 @@
|
||||
class OutCallback:
|
||||
def __call__(self, s: str) -> None: ...
|
||||
330
crawl4ai/html2text/cli.py
Normal file
330
crawl4ai/html2text/cli.py
Normal file
@@ -0,0 +1,330 @@
|
||||
import argparse
|
||||
import sys
|
||||
|
||||
from . import HTML2Text, __version__, config
|
||||
|
||||
|
||||
def main() -> None:
|
||||
baseurl = ""
|
||||
|
||||
class bcolors:
|
||||
HEADER = "\033[95m"
|
||||
OKBLUE = "\033[94m"
|
||||
OKGREEN = "\033[92m"
|
||||
WARNING = "\033[93m"
|
||||
FAIL = "\033[91m"
|
||||
ENDC = "\033[0m"
|
||||
BOLD = "\033[1m"
|
||||
UNDERLINE = "\033[4m"
|
||||
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument(
|
||||
"--default-image-alt",
|
||||
dest="default_image_alt",
|
||||
default=config.DEFAULT_IMAGE_ALT,
|
||||
help="The default alt string for images with missing ones",
|
||||
)
|
||||
p.add_argument(
|
||||
"--pad-tables",
|
||||
dest="pad_tables",
|
||||
action="store_true",
|
||||
default=config.PAD_TABLES,
|
||||
help="pad the cells to equal column width in tables",
|
||||
)
|
||||
p.add_argument(
|
||||
"--no-wrap-links",
|
||||
dest="wrap_links",
|
||||
action="store_false",
|
||||
default=config.WRAP_LINKS,
|
||||
help="don't wrap links during conversion",
|
||||
)
|
||||
p.add_argument(
|
||||
"--wrap-list-items",
|
||||
dest="wrap_list_items",
|
||||
action="store_true",
|
||||
default=config.WRAP_LIST_ITEMS,
|
||||
help="wrap list items during conversion",
|
||||
)
|
||||
p.add_argument(
|
||||
"--wrap-tables",
|
||||
dest="wrap_tables",
|
||||
action="store_true",
|
||||
default=config.WRAP_TABLES,
|
||||
help="wrap tables",
|
||||
)
|
||||
p.add_argument(
|
||||
"--ignore-emphasis",
|
||||
dest="ignore_emphasis",
|
||||
action="store_true",
|
||||
default=config.IGNORE_EMPHASIS,
|
||||
help="don't include any formatting for emphasis",
|
||||
)
|
||||
p.add_argument(
|
||||
"--reference-links",
|
||||
dest="inline_links",
|
||||
action="store_false",
|
||||
default=config.INLINE_LINKS,
|
||||
help="use reference style links instead of inline links",
|
||||
)
|
||||
p.add_argument(
|
||||
"--ignore-links",
|
||||
dest="ignore_links",
|
||||
action="store_true",
|
||||
default=config.IGNORE_ANCHORS,
|
||||
help="don't include any formatting for links",
|
||||
)
|
||||
p.add_argument(
|
||||
"--ignore-mailto-links",
|
||||
action="store_true",
|
||||
dest="ignore_mailto_links",
|
||||
default=config.IGNORE_MAILTO_LINKS,
|
||||
help="don't include mailto: links",
|
||||
)
|
||||
p.add_argument(
|
||||
"--protect-links",
|
||||
dest="protect_links",
|
||||
action="store_true",
|
||||
default=config.PROTECT_LINKS,
|
||||
help="protect links from line breaks surrounding them with angle brackets",
|
||||
)
|
||||
p.add_argument(
|
||||
"--ignore-images",
|
||||
dest="ignore_images",
|
||||
action="store_true",
|
||||
default=config.IGNORE_IMAGES,
|
||||
help="don't include any formatting for images",
|
||||
)
|
||||
p.add_argument(
|
||||
"--images-as-html",
|
||||
dest="images_as_html",
|
||||
action="store_true",
|
||||
default=config.IMAGES_AS_HTML,
|
||||
help=(
|
||||
"Always write image tags as raw html; preserves `height`, `width` and "
|
||||
"`alt` if possible."
|
||||
),
|
||||
)
|
||||
p.add_argument(
|
||||
"--images-to-alt",
|
||||
dest="images_to_alt",
|
||||
action="store_true",
|
||||
default=config.IMAGES_TO_ALT,
|
||||
help="Discard image data, only keep alt text",
|
||||
)
|
||||
p.add_argument(
|
||||
"--images-with-size",
|
||||
dest="images_with_size",
|
||||
action="store_true",
|
||||
default=config.IMAGES_WITH_SIZE,
|
||||
help=(
|
||||
"Write image tags with height and width attrs as raw html to retain "
|
||||
"dimensions"
|
||||
),
|
||||
)
|
||||
p.add_argument(
|
||||
"-g",
|
||||
"--google-doc",
|
||||
action="store_true",
|
||||
dest="google_doc",
|
||||
default=False,
|
||||
help="convert an html-exported Google Document",
|
||||
)
|
||||
p.add_argument(
|
||||
"-d",
|
||||
"--dash-unordered-list",
|
||||
action="store_true",
|
||||
dest="ul_style_dash",
|
||||
default=False,
|
||||
help="use a dash rather than a star for unordered list items",
|
||||
)
|
||||
p.add_argument(
|
||||
"-e",
|
||||
"--asterisk-emphasis",
|
||||
action="store_true",
|
||||
dest="em_style_asterisk",
|
||||
default=False,
|
||||
help="use an asterisk rather than an underscore for emphasized text",
|
||||
)
|
||||
p.add_argument(
|
||||
"-b",
|
||||
"--body-width",
|
||||
dest="body_width",
|
||||
type=int,
|
||||
default=config.BODY_WIDTH,
|
||||
help="number of characters per output line, 0 for no wrap",
|
||||
)
|
||||
p.add_argument(
|
||||
"-i",
|
||||
"--google-list-indent",
|
||||
dest="list_indent",
|
||||
type=int,
|
||||
default=config.GOOGLE_LIST_INDENT,
|
||||
help="number of pixels Google indents nested lists",
|
||||
)
|
||||
p.add_argument(
|
||||
"-s",
|
||||
"--hide-strikethrough",
|
||||
action="store_true",
|
||||
dest="hide_strikethrough",
|
||||
default=False,
|
||||
help="hide strike-through text. only relevant when -g is " "specified as well",
|
||||
)
|
||||
p.add_argument(
|
||||
"--escape-all",
|
||||
action="store_true",
|
||||
dest="escape_snob",
|
||||
default=False,
|
||||
help=(
|
||||
"Escape all special characters. Output is less readable, but avoids "
|
||||
"corner case formatting issues."
|
||||
),
|
||||
)
|
||||
p.add_argument(
|
||||
"--bypass-tables",
|
||||
action="store_true",
|
||||
dest="bypass_tables",
|
||||
default=config.BYPASS_TABLES,
|
||||
help="Format tables in HTML rather than Markdown syntax.",
|
||||
)
|
||||
p.add_argument(
|
||||
"--ignore-tables",
|
||||
action="store_true",
|
||||
dest="ignore_tables",
|
||||
default=config.IGNORE_TABLES,
|
||||
help="Ignore table-related tags (table, th, td, tr) " "while keeping rows.",
|
||||
)
|
||||
p.add_argument(
|
||||
"--single-line-break",
|
||||
action="store_true",
|
||||
dest="single_line_break",
|
||||
default=config.SINGLE_LINE_BREAK,
|
||||
help=(
|
||||
"Use a single line break after a block element rather than two line "
|
||||
"breaks. NOTE: Requires --body-width=0"
|
||||
),
|
||||
)
|
||||
p.add_argument(
|
||||
"--unicode-snob",
|
||||
action="store_true",
|
||||
dest="unicode_snob",
|
||||
default=config.UNICODE_SNOB,
|
||||
help="Use unicode throughout document",
|
||||
)
|
||||
p.add_argument(
|
||||
"--no-automatic-links",
|
||||
action="store_false",
|
||||
dest="use_automatic_links",
|
||||
default=config.USE_AUTOMATIC_LINKS,
|
||||
help="Do not use automatic links wherever applicable",
|
||||
)
|
||||
p.add_argument(
|
||||
"--no-skip-internal-links",
|
||||
action="store_false",
|
||||
dest="skip_internal_links",
|
||||
default=config.SKIP_INTERNAL_LINKS,
|
||||
help="Do not skip internal links",
|
||||
)
|
||||
p.add_argument(
|
||||
"--links-after-para",
|
||||
action="store_true",
|
||||
dest="links_each_paragraph",
|
||||
default=config.LINKS_EACH_PARAGRAPH,
|
||||
help="Put links after each paragraph instead of document",
|
||||
)
|
||||
p.add_argument(
|
||||
"--mark-code",
|
||||
action="store_true",
|
||||
dest="mark_code",
|
||||
default=config.MARK_CODE,
|
||||
help="Mark program code blocks with [code]...[/code]",
|
||||
)
|
||||
p.add_argument(
|
||||
"--decode-errors",
|
||||
dest="decode_errors",
|
||||
default=config.DECODE_ERRORS,
|
||||
help=(
|
||||
"What to do in case of decode errors.'ignore', 'strict' and 'replace' are "
|
||||
"acceptable values"
|
||||
),
|
||||
)
|
||||
p.add_argument(
|
||||
"--open-quote",
|
||||
dest="open_quote",
|
||||
default=config.OPEN_QUOTE,
|
||||
help="The character used to open quotes",
|
||||
)
|
||||
p.add_argument(
|
||||
"--close-quote",
|
||||
dest="close_quote",
|
||||
default=config.CLOSE_QUOTE,
|
||||
help="The character used to close quotes",
|
||||
)
|
||||
p.add_argument(
|
||||
"--version", action="version", version=".".join(map(str, __version__))
|
||||
)
|
||||
p.add_argument("filename", nargs="?")
|
||||
p.add_argument("encoding", nargs="?", default="utf-8")
|
||||
p.add_argument(
|
||||
"--include-sup-sub",
|
||||
dest="include_sup_sub",
|
||||
action="store_true",
|
||||
default=config.INCLUDE_SUP_SUB,
|
||||
help="Include the sup and sub tags",
|
||||
)
|
||||
args = p.parse_args()
|
||||
|
||||
if args.filename and args.filename != "-":
|
||||
with open(args.filename, "rb") as fp:
|
||||
data = fp.read()
|
||||
else:
|
||||
data = sys.stdin.buffer.read()
|
||||
|
||||
try:
|
||||
html = data.decode(args.encoding, args.decode_errors)
|
||||
except UnicodeDecodeError as err:
|
||||
warning = bcolors.WARNING + "Warning:" + bcolors.ENDC
|
||||
warning += " Use the " + bcolors.OKGREEN
|
||||
warning += "--decode-errors=ignore" + bcolors.ENDC + " flag."
|
||||
print(warning)
|
||||
raise err
|
||||
|
||||
h = HTML2Text(baseurl=baseurl)
|
||||
# handle options
|
||||
if args.ul_style_dash:
|
||||
h.ul_item_mark = "-"
|
||||
if args.em_style_asterisk:
|
||||
h.emphasis_mark = "*"
|
||||
h.strong_mark = "__"
|
||||
|
||||
h.body_width = args.body_width
|
||||
h.google_list_indent = args.list_indent
|
||||
h.ignore_emphasis = args.ignore_emphasis
|
||||
h.ignore_links = args.ignore_links
|
||||
h.ignore_mailto_links = args.ignore_mailto_links
|
||||
h.protect_links = args.protect_links
|
||||
h.ignore_images = args.ignore_images
|
||||
h.images_as_html = args.images_as_html
|
||||
h.images_to_alt = args.images_to_alt
|
||||
h.images_with_size = args.images_with_size
|
||||
h.google_doc = args.google_doc
|
||||
h.hide_strikethrough = args.hide_strikethrough
|
||||
h.escape_snob = args.escape_snob
|
||||
h.bypass_tables = args.bypass_tables
|
||||
h.ignore_tables = args.ignore_tables
|
||||
h.single_line_break = args.single_line_break
|
||||
h.inline_links = args.inline_links
|
||||
h.unicode_snob = args.unicode_snob
|
||||
h.use_automatic_links = args.use_automatic_links
|
||||
h.skip_internal_links = args.skip_internal_links
|
||||
h.links_each_paragraph = args.links_each_paragraph
|
||||
h.mark_code = args.mark_code
|
||||
h.wrap_links = args.wrap_links
|
||||
h.wrap_list_items = args.wrap_list_items
|
||||
h.wrap_tables = args.wrap_tables
|
||||
h.pad_tables = args.pad_tables
|
||||
h.default_image_alt = args.default_image_alt
|
||||
h.open_quote = args.open_quote
|
||||
h.close_quote = args.close_quote
|
||||
h.include_sup_sub = args.include_sup_sub
|
||||
|
||||
sys.stdout.write(h.handle(html))
|
||||
172
crawl4ai/html2text/config.py
Normal file
172
crawl4ai/html2text/config.py
Normal file
@@ -0,0 +1,172 @@
|
||||
import re
|
||||
|
||||
# Use Unicode characters instead of their ascii pseudo-replacements
|
||||
UNICODE_SNOB = False
|
||||
|
||||
# Marker to use for marking tables for padding post processing
|
||||
TABLE_MARKER_FOR_PAD = "special_marker_for_table_padding"
|
||||
# Escape all special characters. Output is less readable, but avoids
|
||||
# corner case formatting issues.
|
||||
ESCAPE_SNOB = False
|
||||
ESCAPE_BACKSLASH = False
|
||||
ESCAPE_DOT = False
|
||||
ESCAPE_PLUS = False
|
||||
ESCAPE_DASH = False
|
||||
|
||||
# Put the links after each paragraph instead of at the end.
|
||||
LINKS_EACH_PARAGRAPH = False
|
||||
|
||||
# Wrap long lines at position. 0 for no wrapping.
|
||||
BODY_WIDTH = 78
|
||||
|
||||
# Don't show internal links (href="#local-anchor") -- corresponding link
|
||||
# targets won't be visible in the plain text file anyway.
|
||||
SKIP_INTERNAL_LINKS = True
|
||||
|
||||
# Use inline, rather than reference, formatting for images and links
|
||||
INLINE_LINKS = True
|
||||
|
||||
# Protect links from line breaks surrounding them with angle brackets (in
|
||||
# addition to their square brackets)
|
||||
PROTECT_LINKS = False
|
||||
# WRAP_LINKS = True
|
||||
WRAP_LINKS = True
|
||||
|
||||
# Wrap list items.
|
||||
WRAP_LIST_ITEMS = False
|
||||
|
||||
# Wrap tables
|
||||
WRAP_TABLES = False
|
||||
|
||||
# Number of pixels Google indents nested lists
|
||||
GOOGLE_LIST_INDENT = 36
|
||||
|
||||
# Values Google and others may use to indicate bold text
|
||||
BOLD_TEXT_STYLE_VALUES = ("bold", "700", "800", "900")
|
||||
|
||||
IGNORE_ANCHORS = False
|
||||
IGNORE_MAILTO_LINKS = False
|
||||
IGNORE_IMAGES = False
|
||||
IMAGES_AS_HTML = False
|
||||
IMAGES_TO_ALT = False
|
||||
IMAGES_WITH_SIZE = False
|
||||
IGNORE_EMPHASIS = False
|
||||
MARK_CODE = False
|
||||
DECODE_ERRORS = "strict"
|
||||
DEFAULT_IMAGE_ALT = ""
|
||||
PAD_TABLES = False
|
||||
|
||||
# Convert links with same href and text to <href> format
|
||||
# if they are absolute links
|
||||
USE_AUTOMATIC_LINKS = True
|
||||
|
||||
# For checking space-only lines on line 771
|
||||
RE_SPACE = re.compile(r"\s\+")
|
||||
|
||||
RE_ORDERED_LIST_MATCHER = re.compile(r"\d+\.\s")
|
||||
RE_UNORDERED_LIST_MATCHER = re.compile(r"[-\*\+]\s")
|
||||
RE_MD_CHARS_MATCHER = re.compile(r"([\\\[\]\(\)])")
|
||||
RE_MD_CHARS_MATCHER_ALL = re.compile(r"([`\*_{}\[\]\(\)#!])")
|
||||
|
||||
# to find links in the text
|
||||
RE_LINK = re.compile(r"(\[.*?\] ?\(.*?\))|(\[.*?\]:.*?)")
|
||||
|
||||
# to find table separators
|
||||
RE_TABLE = re.compile(r" \| ")
|
||||
|
||||
RE_MD_DOT_MATCHER = re.compile(
|
||||
r"""
|
||||
^ # start of line
|
||||
(\s*\d+) # optional whitespace and a number
|
||||
(\.) # dot
|
||||
(?=\s) # lookahead assert whitespace
|
||||
""",
|
||||
re.MULTILINE | re.VERBOSE,
|
||||
)
|
||||
RE_MD_PLUS_MATCHER = re.compile(
|
||||
r"""
|
||||
^
|
||||
(\s*)
|
||||
(\+)
|
||||
(?=\s)
|
||||
""",
|
||||
flags=re.MULTILINE | re.VERBOSE,
|
||||
)
|
||||
RE_MD_DASH_MATCHER = re.compile(
|
||||
r"""
|
||||
^
|
||||
(\s*)
|
||||
(-)
|
||||
(?=\s|\-) # followed by whitespace (bullet list, or spaced out hr)
|
||||
# or another dash (header or hr)
|
||||
""",
|
||||
flags=re.MULTILINE | re.VERBOSE,
|
||||
)
|
||||
RE_SLASH_CHARS = r"\`*_{}[]()#+-.!"
|
||||
RE_MD_BACKSLASH_MATCHER = re.compile(
|
||||
r"""
|
||||
(\\) # match one slash
|
||||
(?=[%s]) # followed by a char that requires escaping
|
||||
"""
|
||||
% re.escape(RE_SLASH_CHARS),
|
||||
flags=re.VERBOSE,
|
||||
)
|
||||
|
||||
UNIFIABLE = {
|
||||
"rsquo": "'",
|
||||
"lsquo": "'",
|
||||
"rdquo": '"',
|
||||
"ldquo": '"',
|
||||
"copy": "(C)",
|
||||
"mdash": "--",
|
||||
"nbsp": " ",
|
||||
"rarr": "->",
|
||||
"larr": "<-",
|
||||
"middot": "*",
|
||||
"ndash": "-",
|
||||
"oelig": "oe",
|
||||
"aelig": "ae",
|
||||
"agrave": "a",
|
||||
"aacute": "a",
|
||||
"acirc": "a",
|
||||
"atilde": "a",
|
||||
"auml": "a",
|
||||
"aring": "a",
|
||||
"egrave": "e",
|
||||
"eacute": "e",
|
||||
"ecirc": "e",
|
||||
"euml": "e",
|
||||
"igrave": "i",
|
||||
"iacute": "i",
|
||||
"icirc": "i",
|
||||
"iuml": "i",
|
||||
"ograve": "o",
|
||||
"oacute": "o",
|
||||
"ocirc": "o",
|
||||
"otilde": "o",
|
||||
"ouml": "o",
|
||||
"ugrave": "u",
|
||||
"uacute": "u",
|
||||
"ucirc": "u",
|
||||
"uuml": "u",
|
||||
"lrm": "",
|
||||
"rlm": "",
|
||||
}
|
||||
|
||||
# Format tables in HTML rather than Markdown syntax
|
||||
BYPASS_TABLES = False
|
||||
# Ignore table-related tags (table, th, td, tr) while keeping rows
|
||||
IGNORE_TABLES = False
|
||||
|
||||
|
||||
# Use a single line break after a block element rather than two line breaks.
|
||||
# NOTE: Requires body width setting to be 0.
|
||||
SINGLE_LINE_BREAK = False
|
||||
|
||||
|
||||
# Use double quotation marks when converting the <q> tag.
|
||||
OPEN_QUOTE = '"'
|
||||
CLOSE_QUOTE = '"'
|
||||
|
||||
# Include the <sup> and <sub> tags
|
||||
INCLUDE_SUP_SUB = False
|
||||
18
crawl4ai/html2text/elements.py
Normal file
18
crawl4ai/html2text/elements.py
Normal file
@@ -0,0 +1,18 @@
|
||||
from typing import Dict, Optional
|
||||
|
||||
|
||||
class AnchorElement:
|
||||
__slots__ = ["attrs", "count", "outcount"]
|
||||
|
||||
def __init__(self, attrs: Dict[str, Optional[str]], count: int, outcount: int):
|
||||
self.attrs = attrs
|
||||
self.count = count
|
||||
self.outcount = outcount
|
||||
|
||||
|
||||
class ListElement:
|
||||
__slots__ = ["name", "num"]
|
||||
|
||||
def __init__(self, name: str, num: int):
|
||||
self.name = name
|
||||
self.num = num
|
||||
303
crawl4ai/html2text/utils.py
Normal file
303
crawl4ai/html2text/utils.py
Normal file
@@ -0,0 +1,303 @@
|
||||
import html.entities
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
from . import config
|
||||
|
||||
unifiable_n = {
|
||||
html.entities.name2codepoint[k]: v
|
||||
for k, v in config.UNIFIABLE.items()
|
||||
if k != "nbsp"
|
||||
}
|
||||
|
||||
|
||||
def hn(tag: str) -> int:
|
||||
if tag[0] == "h" and len(tag) == 2:
|
||||
n = tag[1]
|
||||
if "0" < n <= "9":
|
||||
return int(n)
|
||||
return 0
|
||||
|
||||
|
||||
def dumb_property_dict(style: str) -> Dict[str, str]:
|
||||
"""
|
||||
:returns: A hash of css attributes
|
||||
"""
|
||||
return {
|
||||
x.strip().lower(): y.strip().lower()
|
||||
for x, y in [z.split(":", 1) for z in style.split(";") if ":" in z]
|
||||
}
|
||||
|
||||
|
||||
def dumb_css_parser(data: str) -> Dict[str, Dict[str, str]]:
|
||||
"""
|
||||
:type data: str
|
||||
|
||||
:returns: A hash of css selectors, each of which contains a hash of
|
||||
css attributes.
|
||||
:rtype: dict
|
||||
"""
|
||||
# remove @import sentences
|
||||
data += ";"
|
||||
importIndex = data.find("@import")
|
||||
while importIndex != -1:
|
||||
data = data[0:importIndex] + data[data.find(";", importIndex) + 1 :]
|
||||
importIndex = data.find("@import")
|
||||
|
||||
# parse the css. reverted from dictionary comprehension in order to
|
||||
# support older pythons
|
||||
pairs = [x.split("{") for x in data.split("}") if "{" in x.strip()]
|
||||
try:
|
||||
elements = {a.strip(): dumb_property_dict(b) for a, b in pairs}
|
||||
except ValueError:
|
||||
elements = {} # not that important
|
||||
|
||||
return elements
|
||||
|
||||
|
||||
def element_style(
|
||||
attrs: Dict[str, Optional[str]],
|
||||
style_def: Dict[str, Dict[str, str]],
|
||||
parent_style: Dict[str, str],
|
||||
) -> Dict[str, str]:
|
||||
"""
|
||||
:type attrs: dict
|
||||
:type style_def: dict
|
||||
:type style_def: dict
|
||||
|
||||
:returns: A hash of the 'final' style attributes of the element
|
||||
:rtype: dict
|
||||
"""
|
||||
style = parent_style.copy()
|
||||
if "class" in attrs:
|
||||
assert attrs["class"] is not None
|
||||
for css_class in attrs["class"].split():
|
||||
css_style = style_def.get("." + css_class, {})
|
||||
style.update(css_style)
|
||||
if "style" in attrs:
|
||||
assert attrs["style"] is not None
|
||||
immediate_style = dumb_property_dict(attrs["style"])
|
||||
style.update(immediate_style)
|
||||
|
||||
return style
|
||||
|
||||
|
||||
def google_list_style(style: Dict[str, str]) -> str:
|
||||
"""
|
||||
Finds out whether this is an ordered or unordered list
|
||||
|
||||
:type style: dict
|
||||
|
||||
:rtype: str
|
||||
"""
|
||||
if "list-style-type" in style:
|
||||
list_style = style["list-style-type"]
|
||||
if list_style in ["disc", "circle", "square", "none"]:
|
||||
return "ul"
|
||||
|
||||
return "ol"
|
||||
|
||||
|
||||
def google_has_height(style: Dict[str, str]) -> bool:
|
||||
"""
|
||||
Check if the style of the element has the 'height' attribute
|
||||
explicitly defined
|
||||
|
||||
:type style: dict
|
||||
|
||||
:rtype: bool
|
||||
"""
|
||||
return "height" in style
|
||||
|
||||
|
||||
def google_text_emphasis(style: Dict[str, str]) -> List[str]:
|
||||
"""
|
||||
:type style: dict
|
||||
|
||||
:returns: A list of all emphasis modifiers of the element
|
||||
:rtype: list
|
||||
"""
|
||||
emphasis = []
|
||||
if "text-decoration" in style:
|
||||
emphasis.append(style["text-decoration"])
|
||||
if "font-style" in style:
|
||||
emphasis.append(style["font-style"])
|
||||
if "font-weight" in style:
|
||||
emphasis.append(style["font-weight"])
|
||||
|
||||
return emphasis
|
||||
|
||||
|
||||
def google_fixed_width_font(style: Dict[str, str]) -> bool:
|
||||
"""
|
||||
Check if the css of the current element defines a fixed width font
|
||||
|
||||
:type style: dict
|
||||
|
||||
:rtype: bool
|
||||
"""
|
||||
font_family = ""
|
||||
if "font-family" in style:
|
||||
font_family = style["font-family"]
|
||||
return "courier new" == font_family or "consolas" == font_family
|
||||
|
||||
|
||||
def list_numbering_start(attrs: Dict[str, Optional[str]]) -> int:
|
||||
"""
|
||||
Extract numbering from list element attributes
|
||||
|
||||
:type attrs: dict
|
||||
|
||||
:rtype: int or None
|
||||
"""
|
||||
if "start" in attrs:
|
||||
assert attrs["start"] is not None
|
||||
try:
|
||||
return int(attrs["start"]) - 1
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
def skipwrap(
|
||||
para: str, wrap_links: bool, wrap_list_items: bool, wrap_tables: bool
|
||||
) -> bool:
|
||||
# If it appears to contain a link
|
||||
# don't wrap
|
||||
if not wrap_links and config.RE_LINK.search(para):
|
||||
return True
|
||||
# If the text begins with four spaces or one tab, it's a code block;
|
||||
# don't wrap
|
||||
if para[0:4] == " " or para[0] == "\t":
|
||||
return True
|
||||
|
||||
# If the text begins with only two "--", possibly preceded by
|
||||
# whitespace, that's an emdash; so wrap.
|
||||
stripped = para.lstrip()
|
||||
if stripped[0:2] == "--" and len(stripped) > 2 and stripped[2] != "-":
|
||||
return False
|
||||
|
||||
# I'm not sure what this is for; I thought it was to detect lists,
|
||||
# but there's a <br>-inside-<span> case in one of the tests that
|
||||
# also depends upon it.
|
||||
if stripped[0:1] in ("-", "*") and not stripped[0:2] == "**":
|
||||
return not wrap_list_items
|
||||
|
||||
# If text contains a pipe character it is likely a table
|
||||
if not wrap_tables and config.RE_TABLE.search(para):
|
||||
return True
|
||||
|
||||
# If the text begins with a single -, *, or +, followed by a space,
|
||||
# or an integer, followed by a ., followed by a space (in either
|
||||
# case optionally proceeded by whitespace), it's a list; don't wrap.
|
||||
return bool(
|
||||
config.RE_ORDERED_LIST_MATCHER.match(stripped)
|
||||
or config.RE_UNORDERED_LIST_MATCHER.match(stripped)
|
||||
)
|
||||
|
||||
|
||||
def escape_md(text: str) -> str:
|
||||
"""
|
||||
Escapes markdown-sensitive characters within other markdown
|
||||
constructs.
|
||||
"""
|
||||
return config.RE_MD_CHARS_MATCHER.sub(r"\\\1", text)
|
||||
|
||||
|
||||
def escape_md_section(
|
||||
text: str,
|
||||
escape_backslash: bool = True,
|
||||
snob: bool = False,
|
||||
escape_dot: bool = True,
|
||||
escape_plus: bool = True,
|
||||
escape_dash: bool = True
|
||||
) -> str:
|
||||
"""
|
||||
Escapes markdown-sensitive characters across whole document sections.
|
||||
Each escaping operation can be controlled individually.
|
||||
"""
|
||||
if escape_backslash:
|
||||
text = config.RE_MD_BACKSLASH_MATCHER.sub(r"\\\1", text)
|
||||
|
||||
if snob:
|
||||
text = config.RE_MD_CHARS_MATCHER_ALL.sub(r"\\\1", text)
|
||||
|
||||
if escape_dot:
|
||||
text = config.RE_MD_DOT_MATCHER.sub(r"\1\\\2", text)
|
||||
|
||||
if escape_plus:
|
||||
text = config.RE_MD_PLUS_MATCHER.sub(r"\1\\\2", text)
|
||||
|
||||
if escape_dash:
|
||||
text = config.RE_MD_DASH_MATCHER.sub(r"\1\\\2", text)
|
||||
|
||||
return text
|
||||
|
||||
def reformat_table(lines: List[str], right_margin: int) -> List[str]:
|
||||
"""
|
||||
Given the lines of a table
|
||||
padds the cells and returns the new lines
|
||||
"""
|
||||
# find the maximum width of the columns
|
||||
max_width = [len(x.rstrip()) + right_margin for x in lines[0].split("|")]
|
||||
max_cols = len(max_width)
|
||||
for line in lines:
|
||||
cols = [x.rstrip() for x in line.split("|")]
|
||||
num_cols = len(cols)
|
||||
|
||||
# don't drop any data if colspan attributes result in unequal lengths
|
||||
if num_cols < max_cols:
|
||||
cols += [""] * (max_cols - num_cols)
|
||||
elif max_cols < num_cols:
|
||||
max_width += [len(x) + right_margin for x in cols[-(num_cols - max_cols) :]]
|
||||
max_cols = num_cols
|
||||
|
||||
max_width = [
|
||||
max(len(x) + right_margin, old_len) for x, old_len in zip(cols, max_width)
|
||||
]
|
||||
|
||||
# reformat
|
||||
new_lines = []
|
||||
for line in lines:
|
||||
cols = [x.rstrip() for x in line.split("|")]
|
||||
if set(line.strip()) == set("-|"):
|
||||
filler = "-"
|
||||
new_cols = [
|
||||
x.rstrip() + (filler * (M - len(x.rstrip())))
|
||||
for x, M in zip(cols, max_width)
|
||||
]
|
||||
new_lines.append("|-" + "|".join(new_cols) + "|")
|
||||
else:
|
||||
filler = " "
|
||||
new_cols = [
|
||||
x.rstrip() + (filler * (M - len(x.rstrip())))
|
||||
for x, M in zip(cols, max_width)
|
||||
]
|
||||
new_lines.append("| " + "|".join(new_cols) + "|")
|
||||
return new_lines
|
||||
|
||||
|
||||
def pad_tables_in_text(text: str, right_margin: int = 1) -> str:
|
||||
"""
|
||||
Provide padding for tables in the text
|
||||
"""
|
||||
lines = text.split("\n")
|
||||
table_buffer = [] # type: List[str]
|
||||
table_started = False
|
||||
new_lines = []
|
||||
for line in lines:
|
||||
# Toggle table started
|
||||
if config.TABLE_MARKER_FOR_PAD in line:
|
||||
table_started = not table_started
|
||||
if not table_started:
|
||||
table = reformat_table(table_buffer, right_margin)
|
||||
new_lines.extend(table)
|
||||
table_buffer = []
|
||||
new_lines.append("")
|
||||
continue
|
||||
# Process lines
|
||||
if table_started:
|
||||
table_buffer.append(line)
|
||||
else:
|
||||
new_lines.append(line)
|
||||
return "\n".join(new_lines)
|
||||
@@ -72,10 +72,18 @@ def load_bert_base_uncased():
|
||||
return tokenizer, model
|
||||
|
||||
@lru_cache()
|
||||
def load_bge_small_en_v1_5():
|
||||
def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple:
|
||||
"""Load the Hugging Face model for embedding.
|
||||
|
||||
Args:
|
||||
model_name (str, optional): The model name to load. Defaults to "BAAI/bge-small-en-v1.5".
|
||||
|
||||
Returns:
|
||||
tuple: The tokenizer and model.
|
||||
"""
|
||||
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
|
||||
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
|
||||
model = AutoModel.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name, resume_download=None)
|
||||
model = AutoModel.from_pretrained(model_name, resume_download=None)
|
||||
model.eval()
|
||||
model, device = set_model_device(model)
|
||||
return tokenizer, model
|
||||
|
||||
@@ -14,6 +14,8 @@ class CrawlResult(BaseModel):
|
||||
links: Dict[str, List[Dict]] = {}
|
||||
screenshot: Optional[str] = None
|
||||
markdown: Optional[str] = None
|
||||
fit_markdown: Optional[str] = None
|
||||
fit_html: Optional[str] = None
|
||||
extracted_content: Optional[str] = None
|
||||
metadata: Optional[dict] = None
|
||||
error_message: Optional[str] = None
|
||||
|
||||
@@ -1,13 +1,12 @@
|
||||
import time
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from bs4 import BeautifulSoup, Comment, element, Tag, NavigableString
|
||||
import html2text
|
||||
import json
|
||||
import html
|
||||
import re
|
||||
import os
|
||||
import platform
|
||||
from html2text import HTML2Text
|
||||
from .html2text import HTML2Text
|
||||
from .prompts import PROMPT_EXTRACT_BLOCKS
|
||||
from .config import *
|
||||
from pathlib import Path
|
||||
@@ -182,9 +181,22 @@ def escape_json_string(s):
|
||||
class CustomHTML2Text(HTML2Text):
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
self.ignore_links = True
|
||||
self.inside_pre = False
|
||||
self.inside_code = False
|
||||
|
||||
self.skip_internal_links = False
|
||||
self.single_line_break = False
|
||||
self.mark_code = False
|
||||
self.include_sup_sub = False
|
||||
self.body_width = 0
|
||||
self.ignore_mailto_links = True
|
||||
self.ignore_links = False
|
||||
self.escape_backslash = False
|
||||
self.escape_dot = False
|
||||
self.escape_plus = False
|
||||
self.escape_dash = False
|
||||
self.escape_snob = False
|
||||
|
||||
|
||||
def handle_tag(self, tag, attrs, start):
|
||||
if tag == 'pre':
|
||||
@@ -194,6 +206,10 @@ class CustomHTML2Text(HTML2Text):
|
||||
else:
|
||||
self.o('\n```')
|
||||
self.inside_pre = False
|
||||
elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
|
||||
pass
|
||||
|
||||
|
||||
# elif tag == 'code' and not self.inside_pre:
|
||||
# if start:
|
||||
# if not self.inside_pre:
|
||||
@@ -692,8 +708,8 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
|
||||
for img in imgs:
|
||||
src = img.get('src', '')
|
||||
if base64_pattern.match(src):
|
||||
# Replace base64 data with empty string
|
||||
img['src'] = base64_pattern.sub('', src)
|
||||
|
||||
cleaned_html = str(body).replace('\n\n', '\n').replace(' ', ' ')
|
||||
cleaned_html = sanitize_html(cleaned_html)
|
||||
|
||||
@@ -964,4 +980,53 @@ def format_html(html_string):
|
||||
soup = BeautifulSoup(html_string, 'html.parser')
|
||||
return soup.prettify()
|
||||
|
||||
def normalize_url(href, base_url):
|
||||
"""Normalize URLs to ensure consistent format"""
|
||||
# Extract protocol and domain from base URL
|
||||
try:
|
||||
base_parts = base_url.split('/')
|
||||
protocol = base_parts[0]
|
||||
domain = base_parts[2]
|
||||
except IndexError:
|
||||
raise ValueError(f"Invalid base URL format: {base_url}")
|
||||
|
||||
# Handle special protocols
|
||||
special_protocols = {'mailto:', 'tel:', 'ftp:', 'file:', 'data:', 'javascript:'}
|
||||
if any(href.lower().startswith(proto) for proto in special_protocols):
|
||||
return href.strip()
|
||||
|
||||
# Handle anchor links
|
||||
if href.startswith('#'):
|
||||
return f"{base_url}{href}"
|
||||
|
||||
# Handle protocol-relative URLs
|
||||
if href.startswith('//'):
|
||||
return f"{protocol}{href}"
|
||||
|
||||
# Handle root-relative URLs
|
||||
if href.startswith('/'):
|
||||
return f"{protocol}//{domain}{href}"
|
||||
|
||||
# Handle relative URLs
|
||||
if not href.startswith(('http://', 'https://')):
|
||||
# Remove leading './' if present
|
||||
href = href.lstrip('./')
|
||||
return f"{protocol}//{domain}/{href}"
|
||||
|
||||
return href.strip()
|
||||
|
||||
def is_external_url(url, base_domain):
|
||||
"""Determine if a URL is external"""
|
||||
special_protocols = {'mailto:', 'tel:', 'ftp:', 'file:', 'data:', 'javascript:'}
|
||||
if any(url.lower().startswith(proto) for proto in special_protocols):
|
||||
return True
|
||||
|
||||
try:
|
||||
# Handle URLs with protocol
|
||||
if url.startswith(('http://', 'https://')):
|
||||
url_domain = url.split('/')[2]
|
||||
return base_domain.lower() not in url_domain.lower()
|
||||
except IndexError:
|
||||
return False
|
||||
|
||||
return False
|
||||
|
||||
157
docs/details/extraction.md
Normal file
157
docs/details/extraction.md
Normal file
@@ -0,0 +1,157 @@
|
||||
### Extraction Strategies
|
||||
|
||||
#### 1. LLMExtractionStrategy
|
||||
```python
|
||||
LLMExtractionStrategy(
|
||||
# Core Parameters
|
||||
provider: str = DEFAULT_PROVIDER, # LLM provider (e.g., "openai/gpt-4", "huggingface/...", "ollama/...")
|
||||
api_token: Optional[str] = None, # API token for the provider
|
||||
instruction: str = None, # Custom instruction for extraction
|
||||
schema: Dict = None, # Pydantic model schema for structured extraction
|
||||
extraction_type: str = "block", # Type of extraction: "block" or "schema"
|
||||
|
||||
# Chunking Parameters
|
||||
chunk_token_threshold: int = CHUNK_TOKEN_THRESHOLD, # Maximum tokens per chunk
|
||||
overlap_rate: float = OVERLAP_RATE, # Overlap between chunks
|
||||
word_token_rate: float = WORD_TOKEN_RATE, # Conversion rate from words to tokens
|
||||
apply_chunking: bool = True, # Whether to apply text chunking
|
||||
|
||||
# API Configuration
|
||||
base_url: str = None, # Base URL for API calls
|
||||
api_base: str = None, # Alternative base URL
|
||||
extra_args: Dict = {}, # Additional provider-specific arguments
|
||||
|
||||
verbose: bool = False # Enable verbose logging
|
||||
)
|
||||
```
|
||||
|
||||
Usage Example:
|
||||
```python
|
||||
class NewsArticle(BaseModel):
|
||||
title: str
|
||||
content: str
|
||||
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="ollama/nemotron",
|
||||
api_token="your-token",
|
||||
schema=NewsArticle.schema(),
|
||||
instruction="Extract news article content with title and main text"
|
||||
)
|
||||
|
||||
result = await crawler.arun(url="https://example.com", extraction_strategy=strategy)
|
||||
```
|
||||
|
||||
#### 2. JsonCssExtractionStrategy
|
||||
```python
|
||||
JsonCssExtractionStrategy(
|
||||
schema: Dict[str, Any], # Schema defining extraction rules
|
||||
verbose: bool = False # Enable verbose logging
|
||||
)
|
||||
|
||||
# Schema Structure
|
||||
schema = {
|
||||
"name": str, # Name of the extraction schema
|
||||
"baseSelector": str, # CSS selector for base elements
|
||||
"fields": [
|
||||
{
|
||||
"name": str, # Field name
|
||||
"selector": str, # CSS selector
|
||||
"type": str, # Field type: "text", "attribute", "html", "regex", "nested", "list", "nested_list"
|
||||
"attribute": str, # For type="attribute"
|
||||
"pattern": str, # For type="regex"
|
||||
"transform": str, # Optional: "lowercase", "uppercase", "strip"
|
||||
"default": Any, # Default value if extraction fails
|
||||
"fields": List[Dict], # For nested/list types
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Usage Example:
|
||||
```python
|
||||
schema = {
|
||||
"name": "News Articles",
|
||||
"baseSelector": "article.news-item",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h1",
|
||||
"type": "text",
|
||||
"transform": "strip"
|
||||
},
|
||||
{
|
||||
"name": "date",
|
||||
"selector": ".date",
|
||||
"type": "attribute",
|
||||
"attribute": "datetime"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
strategy = JsonCssExtractionStrategy(schema)
|
||||
result = await crawler.arun(url="https://example.com", extraction_strategy=strategy)
|
||||
```
|
||||
|
||||
#### 3. CosineStrategy
|
||||
```python
|
||||
CosineStrategy(
|
||||
# Content Filtering
|
||||
semantic_filter: str = None, # Keyword filter for document filtering
|
||||
word_count_threshold: int = 10, # Minimum words per cluster
|
||||
sim_threshold: float = 0.3, # Similarity threshold for filtering
|
||||
|
||||
# Clustering Parameters
|
||||
max_dist: float = 0.2, # Maximum distance for clustering
|
||||
linkage_method: str = 'ward', # Clustering linkage method
|
||||
top_k: int = 3, # Number of top categories to extract
|
||||
|
||||
# Model Configuration
|
||||
model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', # Embedding model
|
||||
|
||||
verbose: bool = False # Enable verbose logging
|
||||
)
|
||||
```
|
||||
|
||||
### Chunking Strategies
|
||||
|
||||
#### 1. RegexChunking
|
||||
```python
|
||||
RegexChunking(
|
||||
patterns: List[str] = None # List of regex patterns for splitting text
|
||||
# Default pattern: [r'\n\n']
|
||||
)
|
||||
```
|
||||
|
||||
Usage Example:
|
||||
```python
|
||||
chunker = RegexChunking(patterns=[r'\n\n', r'\.\s+']) # Split on double newlines and sentences
|
||||
chunks = chunker.chunk(text)
|
||||
```
|
||||
|
||||
#### 2. SlidingWindowChunking
|
||||
```python
|
||||
SlidingWindowChunking(
|
||||
window_size: int = 100, # Size of the window in words
|
||||
step: int = 50, # Number of words to slide the window
|
||||
)
|
||||
```
|
||||
|
||||
Usage Example:
|
||||
```python
|
||||
chunker = SlidingWindowChunking(window_size=200, step=100)
|
||||
chunks = chunker.chunk(text) # Creates overlapping chunks of 200 words, moving 100 words at a time
|
||||
```
|
||||
|
||||
#### 3. OverlappingWindowChunking
|
||||
```python
|
||||
OverlappingWindowChunking(
|
||||
window_size: int = 1000, # Size of each chunk in words
|
||||
overlap: int = 100 # Number of words to overlap between chunks
|
||||
)
|
||||
```
|
||||
|
||||
Usage Example:
|
||||
```python
|
||||
chunker = OverlappingWindowChunking(window_size=500, overlap=50)
|
||||
chunks = chunker.chunk(text) # Creates 500-word chunks with 50-word overlap
|
||||
```
|
||||
175
docs/details/feature_lists.md
Normal file
175
docs/details/feature_lists.md
Normal file
@@ -0,0 +1,175 @@
|
||||
# Features
|
||||
|
||||
## Current Features
|
||||
1. Async-first architecture for high-performance web crawling
|
||||
2. Built-in anti-bot detection bypass ("magic mode")
|
||||
3. Multiple browser engine support (Chromium, Firefox, WebKit)
|
||||
4. Smart session management with automatic cleanup
|
||||
5. Automatic content cleaning and relevance scoring
|
||||
6. Built-in markdown generation with formatting preservation
|
||||
7. Intelligent image scoring and filtering
|
||||
8. Automatic popup and overlay removal
|
||||
9. Smart wait conditions (CSS/JavaScript based)
|
||||
10. Multi-provider LLM integration (OpenAI, HuggingFace, Ollama)
|
||||
11. Schema-based structured data extraction
|
||||
12. Automated iframe content processing
|
||||
13. Intelligent link categorization (internal/external)
|
||||
14. Multiple chunking strategies for large content
|
||||
15. Real-time HTML cleaning and sanitization
|
||||
16. Automatic screenshot capabilities
|
||||
17. Social media link filtering
|
||||
18. Semantic similarity-based content clustering
|
||||
19. Human behavior simulation for anti-bot bypass
|
||||
20. Proxy support with authentication
|
||||
21. Automatic resource cleanup
|
||||
22. Custom CSS selector-based extraction
|
||||
23. Automatic content relevance scoring ("fit" content)
|
||||
24. Recursive website crawling capabilities
|
||||
25. Flexible hook system for customization
|
||||
26. Built-in caching system
|
||||
27. Domain-based content filtering
|
||||
28. Dynamic content handling with JavaScript execution
|
||||
29. Automatic media content extraction and classification
|
||||
30. Metadata extraction and processing
|
||||
31. Customizable HTML to Markdown conversion
|
||||
32. Token-aware content chunking for LLM processing
|
||||
33. Automatic response header and status code handling
|
||||
34. Browser fingerprint customization
|
||||
35. Multiple extraction strategies (LLM, CSS, Cosine, XPATH)
|
||||
36. Automatic error image generation for failed screenshots
|
||||
37. Smart content overlap handling for large texts
|
||||
38. Built-in rate limiting for batch processing
|
||||
39. Automatic cookie handling
|
||||
40. Browser Console logging and debugging capabilities
|
||||
|
||||
## Feature Techs
|
||||
• Browser Management
|
||||
- Asynchronous browser control
|
||||
- Multi-browser support (Chromium, Firefox, WebKit)
|
||||
- Headless mode support
|
||||
- Browser cleanup and resource management
|
||||
- Custom browser arguments and configuration
|
||||
- Context management with `__aenter__` and `__aexit__`
|
||||
|
||||
• Session Handling
|
||||
- Session management with TTL (Time To Live)
|
||||
- Session reuse capabilities
|
||||
- Session cleanup for expired sessions
|
||||
- Session-based context preservation
|
||||
|
||||
• Stealth Features
|
||||
- Playwright stealth configuration
|
||||
- Navigator properties override
|
||||
- WebDriver detection evasion
|
||||
- Chrome app simulation
|
||||
- Plugin simulation
|
||||
- Language preferences simulation
|
||||
- Hardware concurrency simulation
|
||||
- Media codecs simulation
|
||||
|
||||
• Network Features
|
||||
- Proxy support with authentication
|
||||
- Custom headers management
|
||||
- Cookie handling
|
||||
- Response header capture
|
||||
- Status code tracking
|
||||
- Network idle detection
|
||||
|
||||
• Page Interaction
|
||||
- Smart wait functionality for multiple conditions
|
||||
- CSS selector-based waiting
|
||||
- JavaScript condition waiting
|
||||
- Custom JavaScript execution
|
||||
- User interaction simulation (mouse/keyboard)
|
||||
- Page scrolling
|
||||
- Timeout management
|
||||
- Load state monitoring
|
||||
|
||||
• Content Processing
|
||||
- HTML content extraction
|
||||
- Iframe processing and content extraction
|
||||
- Delayed content retrieval
|
||||
- Content caching
|
||||
- Cache file management
|
||||
- HTML cleaning and processing
|
||||
|
||||
• Image Handling
|
||||
- Screenshot capabilities (full page)
|
||||
- Base64 encoding of screenshots
|
||||
- Image dimension updating
|
||||
- Image filtering (size/visibility)
|
||||
- Error image generation
|
||||
- Natural width/height preservation
|
||||
|
||||
• Overlay Management
|
||||
- Popup removal
|
||||
- Cookie notice removal
|
||||
- Newsletter dialog removal
|
||||
- Modal removal
|
||||
- Fixed position element removal
|
||||
- Z-index based overlay detection
|
||||
- Visibility checking
|
||||
|
||||
• Hook System
|
||||
- Browser creation hooks
|
||||
- User agent update hooks
|
||||
- Execution start hooks
|
||||
- Navigation hooks (before/after goto)
|
||||
- HTML retrieval hooks
|
||||
- HTML return hooks
|
||||
|
||||
• Error Handling
|
||||
- Browser error catching
|
||||
- Network error handling
|
||||
- Timeout handling
|
||||
- Screenshot error recovery
|
||||
- Invalid selector handling
|
||||
- General exception management
|
||||
|
||||
• Performance Features
|
||||
- Concurrent URL processing
|
||||
- Semaphore-based rate limiting
|
||||
- Async gathering of results
|
||||
- Resource cleanup
|
||||
- Memory management
|
||||
|
||||
• Debug Features
|
||||
- Console logging
|
||||
- Page error logging
|
||||
- Verbose mode
|
||||
- Error message generation
|
||||
- Warning system
|
||||
|
||||
• Security Features
|
||||
- Certificate error handling
|
||||
- Sandbox configuration
|
||||
- GPU handling
|
||||
- CSP (Content Security Policy) compliant waiting
|
||||
|
||||
• Configuration
|
||||
- User agent customization
|
||||
- Viewport configuration
|
||||
- Timeout configuration
|
||||
- Browser type selection
|
||||
- Proxy configuration
|
||||
- Header configuration
|
||||
|
||||
• Data Models
|
||||
- Pydantic model for responses
|
||||
- Type hints throughout code
|
||||
- Structured response format
|
||||
- Optional response fields
|
||||
|
||||
• File System Integration
|
||||
- Cache directory management
|
||||
- File path handling
|
||||
- Cache metadata storage
|
||||
- File read/write operations
|
||||
|
||||
• Metadata Handling
|
||||
- Response headers capture
|
||||
- Status code tracking
|
||||
- Cache metadata
|
||||
- Session tracking
|
||||
- Timestamp management
|
||||
|
||||
150
docs/details/features.md
Normal file
150
docs/details/features.md
Normal file
@@ -0,0 +1,150 @@
|
||||
### 1. Basic Web Crawling
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
print(result.markdown) # Get clean markdown content
|
||||
print(result.html) # Get raw HTML
|
||||
print(result.cleaned_html) # Get cleaned HTML
|
||||
```
|
||||
|
||||
### 2. Browser Control Options
|
||||
- Multiple Browser Support
|
||||
```python
|
||||
# Choose between different browser engines
|
||||
crawler = AsyncWebCrawler(browser_type="firefox") # or "chromium", "webkit"
|
||||
crawler = AsyncWebCrawler(headless=False) # For visible browser
|
||||
```
|
||||
|
||||
- Proxy Configuration
|
||||
```python
|
||||
crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
|
||||
# Or with authentication
|
||||
crawler = AsyncWebCrawler(proxy_config={
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "user",
|
||||
"password": "pass"
|
||||
})
|
||||
```
|
||||
|
||||
### 3. Content Selection & Filtering
|
||||
- CSS Selector Support
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
css_selector=".main-content" # Extract specific content
|
||||
)
|
||||
```
|
||||
|
||||
- Content Filtering Options
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
word_count_threshold=10, # Minimum words per block
|
||||
excluded_tags=['form', 'header'], # Tags to exclude
|
||||
exclude_external_links=True, # Remove external links
|
||||
exclude_social_media_links=True, # Remove social media links
|
||||
exclude_external_images=True # Remove external images
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Dynamic Content Handling
|
||||
- JavaScript Execution
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight)" # Execute custom JS
|
||||
)
|
||||
```
|
||||
|
||||
- Wait Conditions
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
wait_for="css:.my-element", # Wait for element
|
||||
wait_for="js:() => document.readyState === 'complete'" # Wait for condition
|
||||
)
|
||||
```
|
||||
|
||||
### 5. Anti-Bot Protection Handling
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
simulate_user=True, # Simulate human behavior
|
||||
override_navigator=True, # Mask automation signals
|
||||
magic=True # Enable all anti-detection features
|
||||
)
|
||||
```
|
||||
|
||||
### 6. Session Management
|
||||
```python
|
||||
session_id = "my_session"
|
||||
result1 = await crawler.arun(url="https://example.com/page1", session_id=session_id)
|
||||
result2 = await crawler.arun(url="https://example.com/page2", session_id=session_id)
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
### 7. Media Handling
|
||||
- Screenshot Capture
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
screenshot=True
|
||||
)
|
||||
base64_screenshot = result.screenshot
|
||||
```
|
||||
|
||||
- Media Extraction
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
print(result.media['images']) # List of images
|
||||
print(result.media['videos']) # List of videos
|
||||
print(result.media['audios']) # List of audio files
|
||||
```
|
||||
|
||||
### 8. Structured Data Extraction
|
||||
- CSS-based Extraction
|
||||
```python
|
||||
schema = {
|
||||
"name": "News Articles",
|
||||
"baseSelector": "article",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h1", "type": "text"},
|
||||
{"name": "date", "selector": ".date", "type": "text"}
|
||||
]
|
||||
}
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema)
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=extraction_strategy
|
||||
)
|
||||
structured_data = json.loads(result.extracted_content)
|
||||
```
|
||||
|
||||
- LLM-based Extraction (Multiple Providers)
|
||||
```python
|
||||
class NewsArticle(BaseModel):
|
||||
title: str
|
||||
summary: str
|
||||
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="ollama/nemotron", # or "huggingface/...", "ollama/..."
|
||||
api_token="your-token",
|
||||
schema=NewsArticle.schema(),
|
||||
instruction="Extract news article details..."
|
||||
)
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
```
|
||||
|
||||
### 9. Content Cleaning & Processing
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
remove_overlay_elements=True, # Remove popups/modals
|
||||
process_iframes=True, # Process iframe content
|
||||
)
|
||||
print(result.fit_markdown) # Get most relevant content
|
||||
print(result.fit_html) # Get cleaned HTML
|
||||
```
|
||||
457
docs/details/features_details.md
Normal file
457
docs/details/features_details.md
Normal file
@@ -0,0 +1,457 @@
|
||||
I'll expand the outline with detailed descriptions and examples based on all the provided files. I'll start with the first few sections:
|
||||
|
||||
### 1. Basic Web Crawling
|
||||
Basic web crawling provides the foundation for extracting content from websites. The library supports both simple single-page crawling and recursive website crawling.
|
||||
|
||||
```python
|
||||
# Simple page crawling
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
print(result.html) # Raw HTML
|
||||
print(result.markdown) # Cleaned markdown
|
||||
print(result.cleaned_html) # Cleaned HTML
|
||||
|
||||
# Recursive website crawling
|
||||
class SimpleWebsiteScraper:
|
||||
def __init__(self, crawler: AsyncWebCrawler):
|
||||
self.crawler = crawler
|
||||
|
||||
async def scrape(self, start_url: str, max_depth: int):
|
||||
results = await self.scrape_recursive(start_url, max_depth)
|
||||
return results
|
||||
|
||||
# Usage
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
scraper = SimpleWebsiteScraper(crawler)
|
||||
results = await scraper.scrape("https://example.com", depth=2)
|
||||
```
|
||||
|
||||
### 2. Browser Control Options
|
||||
The library provides extensive control over browser behavior, allowing customization of browser type, headless mode, and proxy settings.
|
||||
|
||||
```python
|
||||
# Browser Type Selection
|
||||
async with AsyncWebCrawler(
|
||||
browser_type="firefox", # Options: "chromium", "firefox", "webkit"
|
||||
headless=False, # For visible browser
|
||||
verbose=True # Enable logging
|
||||
) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
# Proxy Configuration
|
||||
async with AsyncWebCrawler(
|
||||
proxy_config={
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "user",
|
||||
"password": "pass"
|
||||
},
|
||||
headers={
|
||||
"User-Agent": "Custom User Agent",
|
||||
"Accept-Language": "en-US,en;q=0.9"
|
||||
}
|
||||
) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
```
|
||||
|
||||
### 3. Content Selection & Filtering
|
||||
The library offers multiple ways to select and filter content, from CSS selectors to word count thresholds.
|
||||
|
||||
```python
|
||||
# CSS Selector and Content Filtering
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
css_selector="article.main-content", # Extract specific content
|
||||
word_count_threshold=10, # Minimum words per block
|
||||
excluded_tags=['form', 'header'], # Tags to exclude
|
||||
exclude_external_links=True, # Remove external links
|
||||
exclude_social_media_links=True, # Remove social media links
|
||||
exclude_domains=["pinterest.com", "facebook.com"] # Exclude specific domains
|
||||
)
|
||||
|
||||
# Custom HTML to Text Options
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
html2text={
|
||||
"escape_dot": False,
|
||||
"links_each_paragraph": True,
|
||||
"protect_links": True
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Dynamic Content Handling
|
||||
The library provides sophisticated handling of dynamic content with JavaScript execution and wait conditions.
|
||||
|
||||
```python
|
||||
# JavaScript Execution and Wait Conditions
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
js_code=[
|
||||
"window.scrollTo(0, document.body.scrollHeight);",
|
||||
"document.querySelector('.load-more').click();"
|
||||
],
|
||||
wait_for="css:.dynamic-content", # Wait for element
|
||||
delay_before_return_html=2.0 # Wait after JS execution
|
||||
)
|
||||
|
||||
# Smart Wait Conditions
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
wait_for="""() => {
|
||||
return document.querySelectorAll('.item').length > 10;
|
||||
}""",
|
||||
page_timeout=60000 # 60 seconds timeout
|
||||
)
|
||||
```
|
||||
|
||||
### 5. Advanced Link Analysis
|
||||
The library provides comprehensive link analysis capabilities, distinguishing between internal and external links, with options for filtering and processing.
|
||||
|
||||
```python
|
||||
# Basic Link Analysis
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
# Access internal and external links
|
||||
for internal_link in result.links['internal']:
|
||||
print(f"Internal: {internal_link['href']} - {internal_link['text']}")
|
||||
|
||||
for external_link in result.links['external']:
|
||||
print(f"External: {external_link['href']} - {external_link['text']}")
|
||||
|
||||
# Advanced Link Filtering
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
exclude_external_links=True, # Remove all external links
|
||||
exclude_social_media_links=True, # Remove social media links
|
||||
exclude_social_media_domains=[ # Custom social media domains
|
||||
"facebook.com", "twitter.com", "instagram.com"
|
||||
],
|
||||
exclude_domains=["pinterest.com"] # Specific domains to exclude
|
||||
)
|
||||
```
|
||||
|
||||
### 6. Anti-Bot Protection Handling
|
||||
The library includes sophisticated anti-detection mechanisms to handle websites with bot protection.
|
||||
|
||||
```python
|
||||
# Basic Anti-Detection
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
simulate_user=True, # Simulate human behavior
|
||||
override_navigator=True # Override navigator properties
|
||||
)
|
||||
|
||||
# Advanced Anti-Detection with Magic Mode
|
||||
async with AsyncWebCrawler(headless=False) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
magic=True, # Enable all anti-detection features
|
||||
remove_overlay_elements=True, # Remove popups/modals automatically
|
||||
# Custom navigator properties
|
||||
js_code="""
|
||||
Object.defineProperty(navigator, 'webdriver', {
|
||||
get: () => undefined
|
||||
});
|
||||
"""
|
||||
)
|
||||
```
|
||||
|
||||
### 7. Session Management
|
||||
Session management allows maintaining state across multiple requests and handling cookies.
|
||||
|
||||
```python
|
||||
# Basic Session Management
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "my_session"
|
||||
|
||||
# Login
|
||||
login_result = await crawler.arun(
|
||||
url="https://example.com/login",
|
||||
session_id=session_id,
|
||||
js_code="document.querySelector('form').submit();"
|
||||
)
|
||||
|
||||
# Use same session for subsequent requests
|
||||
protected_result = await crawler.arun(
|
||||
url="https://example.com/protected",
|
||||
session_id=session_id
|
||||
)
|
||||
|
||||
# Clean up session
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
|
||||
# Advanced Session with Custom Cookies
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
session_id="custom_session",
|
||||
cookies=[{
|
||||
"name": "sessionId",
|
||||
"value": "abc123",
|
||||
"domain": "example.com"
|
||||
}]
|
||||
)
|
||||
```
|
||||
|
||||
### 8. Screenshot and Media Handling
|
||||
The library provides comprehensive media handling capabilities, including screenshots and media content extraction.
|
||||
|
||||
```python
|
||||
# Screenshot Capture
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
screenshot=True,
|
||||
screenshot_wait_for=2.0 # Wait before taking screenshot
|
||||
)
|
||||
|
||||
# Save screenshot
|
||||
if result.screenshot:
|
||||
with open("screenshot.png", "wb") as f:
|
||||
f.write(base64.b64decode(result.screenshot))
|
||||
|
||||
# Media Extraction
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
# Process images with metadata
|
||||
for image in result.media['images']:
|
||||
print(f"Image: {image['src']}")
|
||||
print(f"Alt text: {image['alt']}")
|
||||
print(f"Context: {image['desc']}")
|
||||
print(f"Relevance score: {image['score']}")
|
||||
|
||||
# Process videos and audio
|
||||
for video in result.media['videos']:
|
||||
print(f"Video: {video['src']}")
|
||||
for audio in result.media['audios']:
|
||||
print(f"Audio: {audio['src']}")
|
||||
```
|
||||
|
||||
### 9. Structured Data Extraction & Chunking
|
||||
The library supports multiple strategies for structured data extraction and content chunking.
|
||||
|
||||
```python
|
||||
# LLM-based Extraction
|
||||
class NewsArticle(BaseModel):
|
||||
title: str
|
||||
content: str
|
||||
author: str
|
||||
|
||||
extraction_strategy = LLMExtractionStrategy(
|
||||
provider='openai/gpt-4',
|
||||
api_token="your-token",
|
||||
schema=NewsArticle.schema(),
|
||||
instruction="Extract news article details",
|
||||
chunk_token_threshold=1000,
|
||||
overlap_rate=0.1
|
||||
)
|
||||
|
||||
# CSS-based Extraction
|
||||
schema = {
|
||||
"name": "Product Listing",
|
||||
"baseSelector": ".product-card",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h2",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": ".price",
|
||||
"type": "text",
|
||||
"transform": "strip"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
css_strategy = JsonCssExtractionStrategy(schema)
|
||||
|
||||
# Text Chunking
|
||||
from crawl4ai.chunking_strategy import OverlappingWindowChunking
|
||||
|
||||
chunking_strategy = OverlappingWindowChunking(
|
||||
window_size=1000,
|
||||
overlap=100
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=extraction_strategy,
|
||||
chunking_strategy=chunking_strategy
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
### 10. Content Cleaning & Processing
|
||||
The library provides extensive content cleaning and processing capabilities, ensuring high-quality output in various formats.
|
||||
|
||||
```python
|
||||
# Basic Content Cleaning
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
remove_overlay_elements=True, # Remove popups/modals
|
||||
process_iframes=True, # Process iframe content
|
||||
word_count_threshold=10 # Minimum words per block
|
||||
)
|
||||
|
||||
print(result.cleaned_html) # Clean HTML
|
||||
print(result.fit_html) # Most relevant HTML content
|
||||
print(result.fit_markdown) # Most relevant markdown content
|
||||
|
||||
# Advanced Content Processing
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
excluded_tags=['form', 'header', 'footer', 'nav'],
|
||||
html2text={
|
||||
"escape_dot": False,
|
||||
"body_width": 0,
|
||||
"protect_links": True,
|
||||
"unicode_snob": True,
|
||||
"ignore_links": False,
|
||||
"ignore_images": False,
|
||||
"ignore_emphasis": False,
|
||||
"bypass_tables": False,
|
||||
"ignore_tables": False
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### Advanced Usage Patterns
|
||||
|
||||
#### 1. Combining Multiple Features
|
||||
```python
|
||||
async with AsyncWebCrawler(
|
||||
browser_type="chromium",
|
||||
headless=False,
|
||||
verbose=True
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
# Anti-bot measures
|
||||
magic=True,
|
||||
simulate_user=True,
|
||||
|
||||
# Content selection
|
||||
css_selector="article.main",
|
||||
word_count_threshold=10,
|
||||
|
||||
# Dynamic content handling
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="css:.dynamic-content",
|
||||
|
||||
# Content filtering
|
||||
exclude_external_links=True,
|
||||
exclude_social_media_links=True,
|
||||
|
||||
# Media handling
|
||||
screenshot=True,
|
||||
process_iframes=True,
|
||||
|
||||
# Content cleaning
|
||||
remove_overlay_elements=True
|
||||
)
|
||||
```
|
||||
|
||||
#### 2. Custom Extraction Pipeline
|
||||
```python
|
||||
# Define custom schemas and strategies
|
||||
class Article(BaseModel):
|
||||
title: str
|
||||
content: str
|
||||
date: str
|
||||
|
||||
# CSS extraction for initial content
|
||||
css_schema = {
|
||||
"name": "Article Extraction",
|
||||
"baseSelector": "article",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h1", "type": "text"},
|
||||
{"name": "content", "selector": ".content", "type": "html"},
|
||||
{"name": "date", "selector": ".date", "type": "text"}
|
||||
]
|
||||
}
|
||||
|
||||
# LLM processing for semantic analysis
|
||||
llm_strategy = LLMExtractionStrategy(
|
||||
provider="ollama/nemotron",
|
||||
api_token="your-token",
|
||||
schema=Article.schema(),
|
||||
instruction="Extract and clean article content"
|
||||
)
|
||||
|
||||
# Chunking strategy for large content
|
||||
chunking = OverlappingWindowChunking(window_size=1000, overlap=100)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# First pass: Extract structure
|
||||
css_result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=JsonCssExtractionStrategy(css_schema)
|
||||
)
|
||||
|
||||
# Second pass: Semantic processing
|
||||
llm_result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=llm_strategy,
|
||||
chunking_strategy=chunking
|
||||
)
|
||||
```
|
||||
|
||||
#### 3. Website Crawling with Custom Processing
|
||||
```python
|
||||
class CustomWebsiteCrawler:
|
||||
def __init__(self, crawler: AsyncWebCrawler):
|
||||
self.crawler = crawler
|
||||
self.results = {}
|
||||
|
||||
async def process_page(self, url: str) -> Dict:
|
||||
result = await self.crawler.arun(
|
||||
url=url,
|
||||
magic=True,
|
||||
word_count_threshold=10,
|
||||
exclude_external_links=True,
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True
|
||||
)
|
||||
|
||||
# Process internal links
|
||||
internal_links = [
|
||||
link['href'] for link in result.links['internal']
|
||||
if self._is_valid_link(link['href'])
|
||||
]
|
||||
|
||||
# Extract media
|
||||
media_urls = [img['src'] for img in result.media['images']]
|
||||
|
||||
return {
|
||||
'content': result.markdown,
|
||||
'links': internal_links,
|
||||
'media': media_urls,
|
||||
'metadata': result.metadata
|
||||
}
|
||||
|
||||
async def crawl_website(self, start_url: str, max_depth: int = 2):
|
||||
visited = set()
|
||||
queue = [(start_url, 0)]
|
||||
|
||||
while queue:
|
||||
url, depth = queue.pop(0)
|
||||
if depth > max_depth or url in visited:
|
||||
continue
|
||||
|
||||
visited.add(url)
|
||||
self.results[url] = await self.process_page(url)
|
||||
```
|
||||
|
||||
282
docs/details/input_output.md
Normal file
282
docs/details/input_output.md
Normal file
@@ -0,0 +1,282 @@
|
||||
### AsyncWebCrawler Constructor Parameters
|
||||
```python
|
||||
AsyncWebCrawler(
|
||||
# Core Browser Settings
|
||||
browser_type: str = "chromium", # Options: "chromium", "firefox", "webkit"
|
||||
headless: bool = True, # Whether to run browser in headless mode
|
||||
verbose: bool = False, # Enable verbose logging
|
||||
|
||||
# Cache Settings
|
||||
always_by_pass_cache: bool = False, # Always bypass cache regardless of run settings
|
||||
base_directory: str = str(Path.home()), # Base directory for cache storage
|
||||
|
||||
# Network Settings
|
||||
proxy: str = None, # Simple proxy URL (e.g., "http://proxy.example.com:8080")
|
||||
proxy_config: Dict = None, # Advanced proxy settings with auth: {"server": str, "username": str, "password": str}
|
||||
|
||||
# Browser Behavior
|
||||
sleep_on_close: bool = False, # Wait before closing browser
|
||||
|
||||
# Other Settings passed to AsyncPlaywrightCrawlerStrategy
|
||||
user_agent: str = None, # Custom user agent string
|
||||
headers: Dict[str, str] = {}, # Custom HTTP headers
|
||||
js_code: Union[str, List[str]] = None, # Default JavaScript to execute
|
||||
)
|
||||
```
|
||||
|
||||
### arun() Method Parameters
|
||||
```python
|
||||
arun(
|
||||
# Core Parameters
|
||||
url: str, # Required: URL to crawl
|
||||
|
||||
# Content Selection
|
||||
css_selector: str = None, # CSS selector to extract specific content
|
||||
word_count_threshold: int = MIN_WORD_THRESHOLD, # Minimum words for content blocks
|
||||
|
||||
# Cache Control
|
||||
bypass_cache: bool = False, # Bypass cache for this request
|
||||
|
||||
# Session Management
|
||||
session_id: str = None, # Session identifier for persistent browsing
|
||||
|
||||
# Screenshot Options
|
||||
screenshot: bool = False, # Take page screenshot
|
||||
screenshot_wait_for: float = None, # Wait time before screenshot
|
||||
|
||||
# Content Processing
|
||||
process_iframes: bool = False, # Process iframe content
|
||||
remove_overlay_elements: bool = False, # Remove popups/modals
|
||||
|
||||
# Anti-Bot/Detection
|
||||
simulate_user: bool = False, # Simulate human-like behavior
|
||||
override_navigator: bool = False, # Override navigator properties
|
||||
magic: bool = False, # Enable all anti-detection features
|
||||
|
||||
# Content Filtering
|
||||
excluded_tags: List[str] = None, # HTML tags to exclude
|
||||
exclude_external_links: bool = False, # Remove external links
|
||||
exclude_social_media_links: bool = False, # Remove social media links
|
||||
exclude_external_images: bool = False, # Remove external images
|
||||
exclude_social_media_domains: List[str] = None, # Additional social media domains to exclude
|
||||
remove_forms: bool = False, # Remove all form elements
|
||||
|
||||
# JavaScript Handling
|
||||
js_code: Union[str, List[str]] = None, # JavaScript to execute
|
||||
js_only: bool = False, # Only execute JavaScript without reloading page
|
||||
wait_for: str = None, # Wait condition (CSS selector or JS function)
|
||||
|
||||
# Page Loading
|
||||
page_timeout: int = 60000, # Page load timeout in milliseconds
|
||||
delay_before_return_html: float = None, # Wait before returning HTML
|
||||
|
||||
# Debug Options
|
||||
log_console: bool = False, # Log browser console messages
|
||||
|
||||
# Content Format Control
|
||||
only_text: bool = False, # Extract only text content
|
||||
keep_data_attributes: bool = False, # Keep data-* attributes in HTML
|
||||
|
||||
# Markdown Options
|
||||
include_links_on_markdown: bool = False, # Include links in markdown output
|
||||
html2text: Dict = {}, # HTML to text conversion options
|
||||
|
||||
# Extraction Strategy
|
||||
extraction_strategy: ExtractionStrategy = None, # Strategy for structured data extraction
|
||||
|
||||
# Advanced Browser Control
|
||||
user_agent: str = None, # Override user agent for this request
|
||||
)
|
||||
```
|
||||
|
||||
### Extraction Strategy Parameters
|
||||
```python
|
||||
# JsonCssExtractionStrategy
|
||||
{
|
||||
"name": str, # Name of extraction schema
|
||||
"baseSelector": str, # Base CSS selector
|
||||
"fields": [
|
||||
{
|
||||
"name": str, # Field name
|
||||
"selector": str, # CSS selector
|
||||
"type": str, # Data type ("text", etc.)
|
||||
"transform": str = None # Optional transformation
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# LLMExtractionStrategy
|
||||
{
|
||||
"provider": str, # LLM provider (e.g., "openai/gpt-4", "huggingface/...", "ollama/...")
|
||||
"api_token": str, # API token
|
||||
"schema": dict, # Pydantic model schema
|
||||
"extraction_type": str, # Type of extraction ("schema", etc.)
|
||||
"instruction": str, # Extraction instruction
|
||||
"extra_args": dict = None, # Additional provider-specific arguments
|
||||
"extra_headers": dict = None # Additional HTTP headers
|
||||
}
|
||||
```
|
||||
|
||||
### HTML to Text Conversion Options (html2text parameter)
|
||||
```python
|
||||
{
|
||||
"escape_dot": bool = True, # Escape dots in text
|
||||
# Other html2text library options
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
### CrawlResult Fields
|
||||
|
||||
```python
|
||||
class CrawlResult(BaseModel):
|
||||
# Basic Information
|
||||
url: str # The crawled URL
|
||||
# Example: "https://example.com"
|
||||
|
||||
success: bool # Whether the crawl was successful
|
||||
# Example: True/False
|
||||
|
||||
status_code: Optional[int] # HTTP status code
|
||||
# Example: 200, 404, 500
|
||||
|
||||
# Content Fields
|
||||
html: str # Raw HTML content
|
||||
# Example: "<html><body>...</body></html>"
|
||||
|
||||
cleaned_html: Optional[str] # HTML after cleaning and processing
|
||||
# Example: "<article><p>Clean content...</p></article>"
|
||||
|
||||
fit_html: Optional[str] # Most relevant HTML content after content cleaning strategy
|
||||
# Example: "<div><p>Most relevant content...</p></div>"
|
||||
|
||||
markdown: Optional[str] # HTML converted to markdown
|
||||
# Example: "# Title\n\nContent paragraph..."
|
||||
|
||||
fit_markdown: Optional[str] # Most relevant content in markdown
|
||||
# Example: "# Main Article\n\nKey content..."
|
||||
|
||||
# Media Content
|
||||
media: Dict[str, List[Dict]] = {} # Extracted media information
|
||||
# Example: {
|
||||
# "images": [
|
||||
# {
|
||||
# "src": "https://example.com/image.jpg",
|
||||
# "alt": "Image description",
|
||||
# "desc": "Contextual description",
|
||||
# "score": 5, # Relevance score
|
||||
# "type": "image"
|
||||
# }
|
||||
# ],
|
||||
# "videos": [
|
||||
# {
|
||||
# "src": "https://example.com/video.mp4",
|
||||
# "alt": "Video title",
|
||||
# "type": "video",
|
||||
# "description": "Video context"
|
||||
# }
|
||||
# ],
|
||||
# "audios": [
|
||||
# {
|
||||
# "src": "https://example.com/audio.mp3",
|
||||
# "alt": "Audio title",
|
||||
# "type": "audio",
|
||||
# "description": "Audio context"
|
||||
# }
|
||||
# ]
|
||||
# }
|
||||
|
||||
# Link Information
|
||||
links: Dict[str, List[Dict]] = {} # Extracted links
|
||||
# Example: {
|
||||
# "internal": [
|
||||
# {
|
||||
# "href": "https://example.com/page",
|
||||
# "text": "Link text",
|
||||
# "title": "Link title"
|
||||
# }
|
||||
# ],
|
||||
# "external": [
|
||||
# {
|
||||
# "href": "https://external.com",
|
||||
# "text": "External link text",
|
||||
# "title": "External link title"
|
||||
# }
|
||||
# ]
|
||||
# }
|
||||
|
||||
# Extraction Results
|
||||
extracted_content: Optional[str] # Content from extraction strategy
|
||||
# Example for JsonCssExtractionStrategy:
|
||||
# '[{"title": "Article 1", "date": "2024-03-20"}, ...]'
|
||||
# Example for LLMExtractionStrategy:
|
||||
# '{"entities": [...], "relationships": [...]}'
|
||||
|
||||
# Additional Information
|
||||
metadata: Optional[dict] = None # Page metadata
|
||||
# Example: {
|
||||
# "title": "Page Title",
|
||||
# "description": "Meta description",
|
||||
# "keywords": ["keyword1", "keyword2"],
|
||||
# "author": "Author Name",
|
||||
# "published_date": "2024-03-20"
|
||||
# }
|
||||
|
||||
screenshot: Optional[str] = None # Base64 encoded screenshot
|
||||
# Example: "iVBORw0KGgoAAAANSUhEUgAA..."
|
||||
|
||||
error_message: Optional[str] = None # Error message if crawl failed
|
||||
# Example: "Failed to load page: timeout"
|
||||
|
||||
session_id: Optional[str] = None # Session identifier
|
||||
# Example: "session_123456"
|
||||
|
||||
response_headers: Optional[dict] = None # HTTP response headers
|
||||
# Example: {
|
||||
# "content-type": "text/html",
|
||||
# "server": "nginx/1.18.0",
|
||||
# "date": "Wed, 20 Mar 2024 12:00:00 GMT"
|
||||
# }
|
||||
```
|
||||
|
||||
### Common Usage Patterns:
|
||||
|
||||
1. Basic Content Extraction:
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
print(result.markdown) # Clean, readable content
|
||||
print(result.cleaned_html) # Cleaned HTML
|
||||
```
|
||||
|
||||
2. Media Analysis:
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
for image in result.media["images"]:
|
||||
if image["score"] > 3: # High-relevance images
|
||||
print(f"High-quality image: {image['src']}")
|
||||
```
|
||||
|
||||
3. Link Analysis:
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
internal_links = [link["href"] for link in result.links["internal"]]
|
||||
external_links = [link["href"] for link in result.links["external"]]
|
||||
```
|
||||
|
||||
4. Structured Data Extraction:
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=my_strategy
|
||||
)
|
||||
structured_data = json.loads(result.extracted_content)
|
||||
```
|
||||
|
||||
5. Error Handling:
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
if not result.success:
|
||||
print(f"Crawl failed: {result.error_message}")
|
||||
print(f"Status code: {result.status_code}")
|
||||
```
|
||||
|
||||
67
docs/details/realworld_examples.md
Normal file
67
docs/details/realworld_examples.md
Normal file
@@ -0,0 +1,67 @@
|
||||
1. **E-commerce Product Monitor**
|
||||
- Scraping product details from multiple e-commerce sites
|
||||
- Price tracking with structured data extraction
|
||||
- Handling dynamic content and anti-bot measures
|
||||
- Features: JsonCssExtraction, session management, anti-bot
|
||||
|
||||
2. **News Aggregator & Summarizer**
|
||||
- Crawling news websites
|
||||
- Content extraction and summarization
|
||||
- Topic classification
|
||||
- Features: LLMExtraction, CosineStrategy, content cleaning
|
||||
|
||||
3. **Academic Paper Research Assistant**
|
||||
- Crawling research papers from academic sites
|
||||
- Extracting citations and references
|
||||
- Building knowledge graphs
|
||||
- Features: structured extraction, link analysis, chunking
|
||||
|
||||
4. **Social Media Content Analyzer**
|
||||
- Handling JavaScript-heavy sites
|
||||
- Dynamic content loading
|
||||
- Sentiment analysis integration
|
||||
- Features: dynamic content handling, session management
|
||||
|
||||
5. **Real Estate Market Analyzer**
|
||||
- Scraping property listings
|
||||
- Processing image galleries
|
||||
- Geolocation data extraction
|
||||
- Features: media handling, structured data extraction
|
||||
|
||||
6. **Documentation Site Generator**
|
||||
- Recursive website crawling
|
||||
- Markdown generation
|
||||
- Link validation
|
||||
- Features: website crawling, content cleaning
|
||||
|
||||
7. **Job Board Aggregator**
|
||||
- Handling pagination
|
||||
- Structured job data extraction
|
||||
- Filtering and categorization
|
||||
- Features: session management, JsonCssExtraction
|
||||
|
||||
8. **Recipe Database Builder**
|
||||
- Schema-based extraction
|
||||
- Image processing
|
||||
- Ingredient parsing
|
||||
- Features: structured extraction, media handling
|
||||
|
||||
9. **Travel Blog Content Analyzer**
|
||||
- Location extraction
|
||||
- Image and map processing
|
||||
- Content categorization
|
||||
- Features: CosineStrategy, media handling
|
||||
|
||||
10. **Technical Documentation Scraper**
|
||||
- API documentation extraction
|
||||
- Code snippet processing
|
||||
- Version tracking
|
||||
- Features: content cleaning, structured extraction
|
||||
|
||||
Each example will include:
|
||||
- Problem description
|
||||
- Technical requirements
|
||||
- Complete implementation
|
||||
- Error handling
|
||||
- Output processing
|
||||
- Performance considerations
|
||||
@@ -47,8 +47,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# !pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git\"\n",
|
||||
"!pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git@staging\"\n",
|
||||
"!pip install crawl4ai\n",
|
||||
"!pip install nest-asyncio\n",
|
||||
"!playwright install"
|
||||
]
|
||||
@@ -714,7 +713,7 @@
|
||||
"provenance": []
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"display_name": "venv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
|
||||
@@ -10,7 +10,7 @@ import time
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
from typing import Dict
|
||||
from typing import Dict, List
|
||||
from bs4 import BeautifulSoup
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
@@ -379,6 +379,18 @@ async def crawl_custom_browser_type():
|
||||
print(result.markdown[:500])
|
||||
print("Time taken: ", time.time() - start)
|
||||
|
||||
async def crawl_with_user_simultion():
|
||||
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
|
||||
url = "YOUR-URL-HERE"
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
bypass_cache=True,
|
||||
simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
|
||||
override_navigator = True # Overrides the navigator object to make it look like a real user
|
||||
)
|
||||
|
||||
print(result.markdown)
|
||||
|
||||
async def speed_comparison():
|
||||
# print("\n--- Speed Comparison ---")
|
||||
# print("Firecrawl (simulated):")
|
||||
@@ -444,6 +456,57 @@ async def speed_comparison():
|
||||
print("If you run these tests in an environment with better network conditions,")
|
||||
print("you may observe an even more significant speed advantage for Crawl4AI.")
|
||||
|
||||
async def generate_knowledge_graph():
|
||||
class Entity(BaseModel):
|
||||
name: str
|
||||
description: str
|
||||
|
||||
class Relationship(BaseModel):
|
||||
entity1: Entity
|
||||
entity2: Entity
|
||||
description: str
|
||||
relation_type: str
|
||||
|
||||
class KnowledgeGraph(BaseModel):
|
||||
entities: List[Entity]
|
||||
relationships: List[Relationship]
|
||||
|
||||
extraction_strategy = LLMExtractionStrategy(
|
||||
provider='openai/gpt-4o-mini', # Or any other provider, including Ollama and open source models
|
||||
api_token=os.getenv('OPENAI_API_KEY'), # In case of Ollama just pass "no-token"
|
||||
schema=KnowledgeGraph.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""Extract entities and relationships from the given text."""
|
||||
)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
url = "https://paulgraham.com/love.html"
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
bypass_cache=True,
|
||||
extraction_strategy=extraction_strategy,
|
||||
# magic=True
|
||||
)
|
||||
# print(result.extracted_content)
|
||||
with open(os.path.join(__location__, "kb.json"), "w") as f:
|
||||
f.write(result.extracted_content)
|
||||
|
||||
async def fit_markdown_remove_overlay():
|
||||
async with AsyncWebCrawler(headless = False) as crawler:
|
||||
url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
bypass_cache=True,
|
||||
word_count_threshold = 10,
|
||||
remove_overlay_elements=True,
|
||||
screenshot = True
|
||||
)
|
||||
# Save markdown to file
|
||||
with open(os.path.join(__location__, "mexico_places.md"), "w") as f:
|
||||
f.write(result.fit_markdown)
|
||||
|
||||
print("Done")
|
||||
|
||||
|
||||
async def main():
|
||||
await simple_crawl()
|
||||
await simple_example_with_running_js_code()
|
||||
@@ -455,7 +518,7 @@ async def main():
|
||||
# LLM extraction examples
|
||||
await extract_structured_data_using_llm()
|
||||
await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
|
||||
await extract_structured_data_using_llm("openai/gpt-4", os.getenv("OPENAI_API_KEY"))
|
||||
await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
|
||||
await extract_structured_data_using_llm("ollama/llama3.2")
|
||||
|
||||
# You always can pass custom headers to the extraction strategy
|
||||
|
||||
45
docs/md_v1/mkdocs.yml
Normal file
45
docs/md_v1/mkdocs.yml
Normal file
@@ -0,0 +1,45 @@
|
||||
site_name: Crawl4AI Documentation
|
||||
site_description: 🔥🕷️ Crawl4AI, Open-source LLM Friendly Web Crawler & Scrapper
|
||||
site_url: https://docs.crawl4ai.com
|
||||
repo_url: https://github.com/unclecode/crawl4ai
|
||||
repo_name: unclecode/crawl4ai
|
||||
docs_dir: docs/md
|
||||
nav:
|
||||
- Home: index.md
|
||||
- First Steps:
|
||||
- Introduction: introduction.md
|
||||
- Installation: installation.md
|
||||
- Quick Start: quickstart.md
|
||||
- Examples:
|
||||
- Intro: examples/index.md
|
||||
- Structured Data Extraction: examples/json_css_extraction.md
|
||||
- LLM Extraction: examples/llm_extraction.md
|
||||
- JS Execution & CSS Filtering: examples/js_execution_css_filtering.md
|
||||
- Hooks & Auth: examples/hooks_auth.md
|
||||
- Summarization: examples/summarization.md
|
||||
- Research Assistant: examples/research_assistant.md
|
||||
- Full Details of Using Crawler:
|
||||
- Crawl Request Parameters: full_details/crawl_request_parameters.md
|
||||
- Crawl Result Class: full_details/crawl_result_class.md
|
||||
- Session Based Crawling: full_details/session_based_crawling.md
|
||||
- Advanced Features: full_details/advanced_features.md
|
||||
- Advanced JsonCssExtraction: full_details/advanced_jsoncss_extraction.md
|
||||
- Chunking Strategies: full_details/chunking_strategies.md
|
||||
- Extraction Strategies: full_details/extraction_strategies.md
|
||||
- Miscellaneous:
|
||||
- Change Log: changelog.md
|
||||
- Contact: contact.md
|
||||
|
||||
theme:
|
||||
name: terminal
|
||||
palette: dark
|
||||
|
||||
# Add the css/extra.css
|
||||
extra_css:
|
||||
- assets/styles.css
|
||||
- assets/highlight.css
|
||||
- assets/dmvendor.css
|
||||
|
||||
extra_javascript:
|
||||
- assets/highlight.min.js
|
||||
- assets/highlight_init.js
|
||||
@@ -147,8 +147,8 @@ async def main():
|
||||
url="https://openai.com/api/pricing/",
|
||||
word_count_threshold=1,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv("OPENAI_API_KEY"),
|
||||
provider="openai/gpt-4o", # Or use open source model like "ollama/nemotron"
|
||||
api_token=os.getenv("OPENAI_API_KEY"), # Pass "no-token" if using Ollama
|
||||
schema=OpenAIModelFee.schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||
@@ -196,11 +196,11 @@ In modern web applications, content is often loaded dynamically without changing
|
||||
|
||||
Here's what makes this approach powerful:
|
||||
|
||||
1. **Session Preservation**: By using a `session_id`, we can maintain the state of our crawling session across multiple interactions with the page. This is crucial for navigating through dynamically loaded content.
|
||||
1.**Session Preservation**: By using a `session_id`, we can maintain the state of our crawling session across multiple interactions with the page. This is crucial for navigating through dynamically loaded content.
|
||||
|
||||
2. **Asynchronous JavaScript Execution**: We can execute custom JavaScript to trigger content loading or navigation. In this example, we'll click a "Load More" button to fetch the next page of commits.
|
||||
2.**Asynchronous JavaScript Execution**: We can execute custom JavaScript to trigger content loading or navigation. In this example, we'll click a "Load More" button to fetch the next page of commits.
|
||||
|
||||
3. **Dynamic Content Waiting**: The `wait_for` parameter allows us to specify a condition that must be met before considering the page load complete. This ensures we don't extract data before the new content is fully loaded.
|
||||
3.**Dynamic Content Waiting**: The `wait_for` parameter allows us to specify a condition that must be met before considering the page load complete. This ensures we don't extract data before the new content is fully loaded.
|
||||
|
||||
Let's see how this works with a real-world example: crawling multiple pages of commits on a GitHub repository. The URL doesn't change as we load more commits, so we'll use these advanced techniques to navigate and extract data.
|
||||
|
||||
223
docs/md_v2/advanced/content-processing.md
Normal file
223
docs/md_v2/advanced/content-processing.md
Normal file
@@ -0,0 +1,223 @@
|
||||
# Content Processing
|
||||
|
||||
Crawl4AI provides powerful content processing capabilities that help you extract clean, relevant content from web pages. This guide covers content cleaning, media handling, link analysis, and metadata extraction.
|
||||
|
||||
## Content Cleaning
|
||||
|
||||
### Understanding Clean Content
|
||||
When crawling web pages, you often encounter a lot of noise - advertisements, navigation menus, footers, popups, and other irrelevant content. Crawl4AI automatically cleans this noise using several approaches:
|
||||
|
||||
1. **Basic Cleaning**: Removes unwanted HTML elements and attributes
|
||||
2. **Content Relevance**: Identifies and preserves meaningful content blocks
|
||||
3. **Layout Analysis**: Understands page structure to identify main content areas
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
word_count_threshold=10, # Remove blocks with fewer words
|
||||
excluded_tags=['form', 'nav'], # Remove specific HTML tags
|
||||
remove_overlay_elements=True # Remove popups/modals
|
||||
)
|
||||
|
||||
# Get clean content
|
||||
print(result.cleaned_html) # Cleaned HTML
|
||||
print(result.markdown) # Clean markdown version
|
||||
```
|
||||
|
||||
### Fit Markdown: Smart Content Extraction
|
||||
One of Crawl4AI's most powerful features is `fit_markdown`. This feature uses advanced heuristics to identify and extract the main content from a webpage while excluding irrelevant elements.
|
||||
|
||||
#### How Fit Markdown Works
|
||||
- Analyzes content density and distribution
|
||||
- Identifies content patterns and structures
|
||||
- Removes boilerplate content (headers, footers, sidebars)
|
||||
- Preserves the most relevant content blocks
|
||||
- Maintains content hierarchy and formatting
|
||||
|
||||
#### Perfect For:
|
||||
- Blog posts and articles
|
||||
- News content
|
||||
- Documentation pages
|
||||
- Any page with a clear main content area
|
||||
|
||||
#### Not Recommended For:
|
||||
- E-commerce product listings
|
||||
- Search results pages
|
||||
- Social media feeds
|
||||
- Pages with multiple equal-weight content sections
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
# Get the most relevant content
|
||||
main_content = result.fit_markdown
|
||||
|
||||
# Compare with regular markdown
|
||||
all_content = result.markdown
|
||||
|
||||
print(f"Fit Markdown Length: {len(main_content)}")
|
||||
print(f"Regular Markdown Length: {len(all_content)}")
|
||||
```
|
||||
|
||||
#### Example Use Case
|
||||
```python
|
||||
async def extract_article_content(url: str) -> str:
|
||||
"""Extract main article content from a blog or news site."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url)
|
||||
|
||||
# fit_markdown will focus on the article content,
|
||||
# excluding navigation, ads, and other distractions
|
||||
return result.fit_markdown
|
||||
```
|
||||
|
||||
## Media Processing
|
||||
|
||||
Crawl4AI provides comprehensive media extraction and analysis capabilities. It automatically detects and processes various types of media elements while maintaining their context and relevance.
|
||||
|
||||
### Image Processing
|
||||
The library handles various image scenarios, including:
|
||||
- Regular images
|
||||
- Lazy-loaded images
|
||||
- Background images
|
||||
- Responsive images
|
||||
- Image metadata and context
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
for image in result.media["images"]:
|
||||
# Each image includes rich metadata
|
||||
print(f"Source: {image['src']}")
|
||||
print(f"Alt text: {image['alt']}")
|
||||
print(f"Description: {image['desc']}")
|
||||
print(f"Context: {image['context']}") # Surrounding text
|
||||
print(f"Relevance score: {image['score']}") # 0-10 score
|
||||
```
|
||||
|
||||
### Handling Lazy-Loaded Content
|
||||
Crawl4aai already handles lazy loading for media elements. You can also customize the wait time for lazy-loaded content:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
wait_for="css:img[data-src]", # Wait for lazy images
|
||||
delay_before_return_html=2.0 # Additional wait time
|
||||
)
|
||||
```
|
||||
|
||||
### Video and Audio Content
|
||||
The library extracts video and audio elements with their metadata:
|
||||
|
||||
```python
|
||||
# Process videos
|
||||
for video in result.media["videos"]:
|
||||
print(f"Video source: {video['src']}")
|
||||
print(f"Type: {video['type']}")
|
||||
print(f"Duration: {video.get('duration')}")
|
||||
print(f"Thumbnail: {video.get('poster')}")
|
||||
|
||||
# Process audio
|
||||
for audio in result.media["audios"]:
|
||||
print(f"Audio source: {audio['src']}")
|
||||
print(f"Type: {audio['type']}")
|
||||
print(f"Duration: {audio.get('duration')}")
|
||||
```
|
||||
|
||||
## Link Analysis
|
||||
|
||||
Crawl4AI provides sophisticated link analysis capabilities, helping you understand the relationship between pages and identify important navigation patterns.
|
||||
|
||||
### Link Classification
|
||||
The library automatically categorizes links into:
|
||||
- Internal links (same domain)
|
||||
- External links (different domains)
|
||||
- Social media links
|
||||
- Navigation links
|
||||
- Content links
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
# Analyze internal links
|
||||
for link in result.links["internal"]:
|
||||
print(f"Internal: {link['href']}")
|
||||
print(f"Link text: {link['text']}")
|
||||
print(f"Context: {link['context']}") # Surrounding text
|
||||
print(f"Type: {link['type']}") # nav, content, etc.
|
||||
|
||||
# Analyze external links
|
||||
for link in result.links["external"]:
|
||||
print(f"External: {link['href']}")
|
||||
print(f"Domain: {link['domain']}")
|
||||
print(f"Type: {link['type']}")
|
||||
```
|
||||
|
||||
### Smart Link Filtering
|
||||
Control which links are included in the results:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
exclude_external_links=True, # Remove external links
|
||||
exclude_social_media_links=True, # Remove social media links
|
||||
exclude_social_media_domains=[ # Custom social media domains
|
||||
"facebook.com", "twitter.com", "instagram.com"
|
||||
],
|
||||
exclude_domains=["ads.example.com"] # Exclude specific domains
|
||||
)
|
||||
```
|
||||
|
||||
## Metadata Extraction
|
||||
|
||||
Crawl4AI automatically extracts and processes page metadata, providing valuable information about the content:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
metadata = result.metadata
|
||||
print(f"Title: {metadata['title']}")
|
||||
print(f"Description: {metadata['description']}")
|
||||
print(f"Keywords: {metadata['keywords']}")
|
||||
print(f"Author: {metadata['author']}")
|
||||
print(f"Published Date: {metadata['published_date']}")
|
||||
print(f"Modified Date: {metadata['modified_date']}")
|
||||
print(f"Language: {metadata['language']}")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use Fit Markdown for Articles**
|
||||
```python
|
||||
# Perfect for blog posts, news articles, documentation
|
||||
content = result.fit_markdown
|
||||
```
|
||||
|
||||
2. **Handle Media Appropriately**
|
||||
```python
|
||||
# Filter by relevance score
|
||||
relevant_images = [
|
||||
img for img in result.media["images"]
|
||||
if img['score'] > 5
|
||||
]
|
||||
```
|
||||
|
||||
3. **Combine Link Analysis with Content**
|
||||
```python
|
||||
# Get content links with context
|
||||
content_links = [
|
||||
link for link in result.links["internal"]
|
||||
if link['type'] == 'content'
|
||||
]
|
||||
```
|
||||
|
||||
4. **Clean Content with Purpose**
|
||||
```python
|
||||
# Customize cleaning based on your needs
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
word_count_threshold=20, # Adjust based on content type
|
||||
keep_data_attributes=False, # Remove data attributes
|
||||
process_iframes=True # Include iframe content
|
||||
)
|
||||
```
|
||||
110
docs/md_v2/advanced/hooks-auth.md
Normal file
110
docs/md_v2/advanced/hooks-auth.md
Normal file
@@ -0,0 +1,110 @@
|
||||
# Hooks & Auth for AsyncWebCrawler
|
||||
|
||||
Crawl4AI's AsyncWebCrawler allows you to customize the behavior of the web crawler using hooks. Hooks are asynchronous functions that are called at specific points in the crawling process, allowing you to modify the crawler's behavior or perform additional actions. This example demonstrates how to use various hooks to customize the asynchronous crawling process.
|
||||
|
||||
## Example: Using Crawler Hooks with AsyncWebCrawler
|
||||
|
||||
Let's see how we can customize the AsyncWebCrawler using hooks! In this example, we'll:
|
||||
|
||||
1. Configure the browser when it's created.
|
||||
2. Add custom headers before navigating to the URL.
|
||||
3. Log the current URL after navigation.
|
||||
4. Perform actions after JavaScript execution.
|
||||
5. Log the length of the HTML before returning it.
|
||||
|
||||
### Hook Definitions
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
from playwright.async_api import Page, Browser
|
||||
|
||||
async def on_browser_created(browser: Browser):
|
||||
print("[HOOK] on_browser_created")
|
||||
# Example customization: set browser viewport size
|
||||
context = await browser.new_context(viewport={'width': 1920, 'height': 1080})
|
||||
page = await context.new_page()
|
||||
|
||||
# Example customization: logging in to a hypothetical website
|
||||
await page.goto('https://example.com/login')
|
||||
await page.fill('input[name="username"]', 'testuser')
|
||||
await page.fill('input[name="password"]', 'password123')
|
||||
await page.click('button[type="submit"]')
|
||||
await page.wait_for_selector('#welcome')
|
||||
|
||||
# Add a custom cookie
|
||||
await context.add_cookies([{'name': 'test_cookie', 'value': 'cookie_value', 'url': 'https://example.com'}])
|
||||
|
||||
await page.close()
|
||||
await context.close()
|
||||
|
||||
async def before_goto(page: Page):
|
||||
print("[HOOK] before_goto")
|
||||
# Example customization: add custom headers
|
||||
await page.set_extra_http_headers({'X-Test-Header': 'test'})
|
||||
|
||||
async def after_goto(page: Page):
|
||||
print("[HOOK] after_goto")
|
||||
# Example customization: log the URL
|
||||
print(f"Current URL: {page.url}")
|
||||
|
||||
async def on_execution_started(page: Page):
|
||||
print("[HOOK] on_execution_started")
|
||||
# Example customization: perform actions after JS execution
|
||||
await page.evaluate("console.log('Custom JS executed')")
|
||||
|
||||
async def before_return_html(page: Page, html: str):
|
||||
print("[HOOK] before_return_html")
|
||||
# Example customization: log the HTML length
|
||||
print(f"HTML length: {len(html)}")
|
||||
return page
|
||||
```
|
||||
|
||||
### Using the Hooks with the AsyncWebCrawler
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
async def main():
|
||||
print("\n🔗 Using Crawler Hooks: Let's see how we can customize the AsyncWebCrawler using hooks!")
|
||||
|
||||
crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True)
|
||||
crawler_strategy.set_hook('on_browser_created', on_browser_created)
|
||||
crawler_strategy.set_hook('before_goto', before_goto)
|
||||
crawler_strategy.set_hook('after_goto', after_goto)
|
||||
crawler_strategy.set_hook('on_execution_started', on_execution_started)
|
||||
crawler_strategy.set_hook('before_return_html', before_return_html)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True, crawler_strategy=crawler_strategy) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="footer"
|
||||
)
|
||||
|
||||
print("📦 Crawler Hooks result:")
|
||||
print(result)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Explanation
|
||||
|
||||
- `on_browser_created`: This hook is called when the Playwright browser is created. It sets up the browser context, logs in to a website, and adds a custom cookie.
|
||||
- `before_goto`: This hook is called right before Playwright navigates to the URL. It adds custom HTTP headers.
|
||||
- `after_goto`: This hook is called after Playwright navigates to the URL. It logs the current URL.
|
||||
- `on_execution_started`: This hook is called after any custom JavaScript is executed. It performs additional JavaScript actions.
|
||||
- `before_return_html`: This hook is called before returning the HTML content. It logs the length of the HTML content.
|
||||
|
||||
### Additional Ideas
|
||||
|
||||
- **Handling authentication**: Use the `on_browser_created` hook to handle login processes or set authentication tokens.
|
||||
- **Dynamic header modification**: Modify headers based on the target URL or other conditions in the `before_goto` hook.
|
||||
- **Content verification**: Use the `after_goto` hook to verify that the expected content is present on the page.
|
||||
- **Custom JavaScript injection**: Inject and execute custom JavaScript using the `on_execution_started` hook.
|
||||
- **Content preprocessing**: Modify or analyze the HTML content in the `before_return_html` hook before it's returned.
|
||||
|
||||
By using these hooks, you can customize the behavior of the AsyncWebCrawler to suit your specific needs, including handling authentication, modifying requests, and preprocessing content.
|
||||
0
docs/md_v2/advanced/hooks.md
Normal file
0
docs/md_v2/advanced/hooks.md
Normal file
52
docs/md_v2/advanced/magic-mode.md
Normal file
52
docs/md_v2/advanced/magic-mode.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# Magic Mode & Anti-Bot Protection
|
||||
|
||||
Crawl4AI provides powerful anti-detection capabilities, with Magic Mode being the simplest and most comprehensive solution.
|
||||
|
||||
## Magic Mode
|
||||
|
||||
The easiest way to bypass anti-bot protections:
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
magic=True # Enables all anti-detection features
|
||||
)
|
||||
```
|
||||
|
||||
Magic Mode automatically:
|
||||
- Masks browser automation signals
|
||||
- Simulates human-like behavior
|
||||
- Overrides navigator properties
|
||||
- Handles cookie consent popups
|
||||
- Manages browser fingerprinting
|
||||
- Randomizes timing patterns
|
||||
|
||||
## Manual Anti-Bot Options
|
||||
|
||||
While Magic Mode is recommended, you can also configure individual anti-detection features:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
simulate_user=True, # Simulate human behavior
|
||||
override_navigator=True # Mask automation signals
|
||||
)
|
||||
```
|
||||
|
||||
Note: When `magic=True` is used, you don't need to set these individual options.
|
||||
|
||||
## Example: Handling Protected Sites
|
||||
|
||||
```python
|
||||
async def crawl_protected_site(url: str):
|
||||
async with AsyncWebCrawler(headless=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
magic=True,
|
||||
remove_overlay_elements=True, # Remove popups/modals
|
||||
page_timeout=60000 # Increased timeout for protection checks
|
||||
)
|
||||
|
||||
return result.markdown if result.success else None
|
||||
```
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user