Compare commits

..

20 Commits

Author SHA1 Message Date
UncleCode
0d357ab7d2 feat(scraper): Enhance URL filtering and scoring systems
Implement comprehensive URL filtering and scoring capabilities:

Filters:
- Add URLPatternFilter with glob/regex support
- Implement ContentTypeFilter with MIME type checking
- Add DomainFilter for domain control
- Create FilterChain with stats tracking

Scorers:
- Complete KeywordRelevanceScorer implementation
- Add PathDepthScorer for URL structure scoring
- Implement ContentTypeScorer for file type priorities
- Add FreshnessScorer for date-based scoring
- Add DomainAuthorityScorer for domain weighting
- Create CompositeScorer for combined strategies

Features:
- Add statistics tracking for both filters and scorers
- Implement logging support throughout
- Add resource cleanup methods
- Create comprehensive documentation
- Include performance optimizations

Tests and docs included.
Note: Review URL normalization overlap with recent crawler changes.
2024-11-08 19:02:28 +08:00
UncleCode
bae4665949 feat(scraper): Enhance URL filtering and scoring systems
Implement comprehensive URL filtering and scoring capabilities:

Filters:
- Add URLPatternFilter with glob/regex support
- Implement ContentTypeFilter with MIME type checking
- Add DomainFilter for domain control
- Create FilterChain with stats tracking

Scorers:
- Complete KeywordRelevanceScorer implementation
- Add PathDepthScorer for URL structure scoring
- Implement ContentTypeScorer for file type priorities
- Add FreshnessScorer for date-based scoring
- Add DomainAuthorityScorer for domain weighting
- Create CompositeScorer for combined strategies

Features:
- Add statistics tracking for both filters and scorers
- Implement logging support throughout
- Add resource cleanup methods
- Create comprehensive documentation
- Include performance optimizations

Tests and docs included.
Note: Review URL normalization overlap with recent crawler changes.

- Quick Start is created and added
2024-11-08 18:45:12 +08:00
UncleCode
d11c004fbb Enhanced BFS Strategy: Improved monitoring, resource management & configuration
- Added CrawlStats for comprehensive crawl monitoring
- Implemented proper resource cleanup with shutdown mechanism
- Enhanced URL processing with better validation and politeness controls
- Added configuration options (max_concurrent, timeout, external_links)
- Improved error handling with retry logic
- Added domain-specific queues for better performance
- Created comprehensive documentation

Note: URL normalization needs review - potential duplicate processing
with core crawler for internal links. Currently commented out pending
further investigation of edge cases.
2024-11-08 15:57:23 +08:00
UncleCode
3d1c9a8434 Revieweing the BFS strategy. 2024-11-07 18:54:53 +08:00
UncleCode
be472c624c Refactored AsyncWebScraper to include comprehensive error handling and progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process. 2024-11-06 21:09:47 +08:00
UncleCode
06b21dcc50 Update .gitignore to include new directories for issues and documentation 2024-11-06 18:44:03 +08:00
UncleCode
0f0f60527d Merge pull request #172 from aravindkarnam/scraper
Scraper
2024-11-06 07:00:44 +01:00
Aravind Karnam
8105fd178e Removed stubs for remove_from_future_crawls since the visited set is updated soon as the URL was queued, Removed add_to_retry_queue(url) since retry with exponential backoff with help of tenacity is going to take care of it. 2024-10-17 15:42:43 +05:30
Aravind Karnam
ce7fce4b16 1. Moved to asyncio.wait instead of gather so that results can be yeilded just as they are ready, rather than in batches
2. Moved the visted.add(url), to before the task is put in queue rather than after the crawl is completed. This makes sure that  duplicate crawls doesn't happen when same URL is found at different depth and that get's queued too because the crawl is not yet completed and visted set is not updated.
3. Named the yield_results attribute to stream instead. Since that seems to be popularly used in all other AI libraries for intermediate results.
2024-10-17 12:25:17 +05:30
Aravind Karnam
de28b59aca removed unused imports 2024-10-16 22:36:48 +05:30
Aravind Karnam
04d8b47b92 Exposed min_crawl_delay for BFSScraperStrategy 2024-10-16 22:34:54 +05:30
Aravind Karnam
2943feeecf 1. Added a flag to yield each crawl result,as they become ready along with the final scraper result as another option
2. Removed ascrape_many method, as I'm currently not focusing on it in the first cut of scraper
3. Added some error handling for cases where robots.txt cannot be fetched or parsed.
2024-10-16 22:05:29 +05:30
Aravind Karnam
8a7d29ce85 updated some comments and removed content type checking functionality from core as it's implemented as a filter 2024-10-16 15:59:37 +05:30
aravind
159bd875bd Merge pull request #5 from aravindkarnam/main
Merging 0.3.6
2024-10-16 10:41:22 +05:30
Aravind Karnam
d743adac68 Fixed some bugs in robots.txt processing 2024-10-03 15:58:57 +05:30
Aravind Karnam
7fe220dbd5 1. Introduced a bool flag to ascrape method to switch between sequential and concurrent processing
2. Introduced a dictionary for depth tracking across various tasks
3. Removed redundancy with crawled_urls variable. Instead created a list with visited set variable in returned object.
2024-10-03 11:17:11 +05:30
aravind
65e013d9d1 Merge pull request #3 from aravindkarnam/main
Merging latest changes from main branch
2024-10-03 09:52:12 +05:30
Aravind Karnam
7f3e2e47ed Parallel processing with retry on failure with exponential backoff - Simplified URL validation and normalisation - respecting Robots.txt 2024-09-19 12:34:12 +05:30
aravind
78f26ac263 Merge pull request #2 from aravindkarnam/staging
Staging
2024-09-18 18:16:23 +05:30
Aravind Karnam
44ce12c62c Created scaffolding for Scraper as per the plan. Implemented the ascrape method in bfs_scraper_strategy 2024-09-09 13:13:34 +05:30
146 changed files with 2524 additions and 11192 deletions

2
.gitignore vendored
View File

@@ -206,4 +206,6 @@ git_issues.py
git_issues.md
.tests/
.issues/
.docs/
.issues/

View File

@@ -1,212 +1,5 @@
# Changelog
## [v0.3.73] - 2024-10-24
### Added
- Smart overlay removal system in AsyncPlaywrightCrawlerStrategy:
- Automatic removal of popups, modals, and cookie notices
- Detection and removal of fixed/sticky position elements
- Cleaning of empty block elements
- Configurable via `remove_overlay_elements` parameter
- Enhanced screenshot capabilities:
- Added `screenshot_wait_for` parameter to control timing
- Improved screenshot handling with existing page context
- Better error handling with fallback error images
- New URL normalization utilities:
- `normalize_url` function for consistent URL formatting
- `is_external_url` function for better link classification
- Custom base directory support for cache storage:
- New `base_directory` parameter in AsyncWebCrawler
- Allows specifying alternative locations for `.crawl4ai` folder
### Enhanced
- Link handling improvements:
- Better duplicate link detection
- Enhanced internal/external link classification
- Improved handling of special URL protocols
- Support for anchor links and protocol-relative URLs
- Configuration refinements:
- Streamlined social media domain list
- More focused external content filtering
- LLM extraction strategy:
- Added support for separate API base URL via `api_base` parameter
- Better handling of base URLs in configuration
### Fixed
- Screenshot functionality:
- Resolved issues with screenshot timing and context
- Improved error handling and recovery
- Link processing:
- Fixed URL normalization edge cases
- Better handling of invalid URLs
- Improved error messages for link processing failures
### Developer Notes
- The overlay removal system uses advanced JavaScript injection for better compatibility
- URL normalization handles special cases like mailto:, tel:, and protocol-relative URLs
- Screenshot system now reuses existing page context for better performance
- Link processing maintains separate dictionaries for internal and external links to ensure uniqueness
## [v0.3.72] - 2024-10-22
### Added
- New `ContentCleaningStrategy` class:
- Smart content extraction based on text density and element scoring
- Automatic removal of boilerplate content
- DOM tree analysis for better content identification
- Configurable thresholds for content detection
- Advanced proxy support:
- Added `proxy_config` option for authenticated proxy connections
- Support for username/password in proxy configuration
- New content output formats:
- `fit_markdown`: Optimized markdown output with main content focus
- `fit_html`: Clean HTML with only essential content
### Enhanced
- Image source detection:
- Support for multiple image source attributes (`src`, `data-src`, `srcset`, etc.)
- Automatic fallback through potential source attributes
- Smart handling of srcset attribute
- External content handling:
- Made external link exclusion optional (disabled by default)
- Improved detection and handling of social media links
- Better control over external image filtering
### Fixed
- Image extraction reliability with multiple source attribute checks
- External link and image handling logic for better accuracy
### Developer Notes
- The new `ContentCleaningStrategy` uses configurable thresholds for customization
- Proxy configuration now supports more complex authentication scenarios
- Content extraction process now provides both regular and optimized outputs
## [v0.3.72] - 2024-10-20
### Fixed
- Added support for parsing Base64 encoded images in WebScrappingStrategy
### Added
- Forked and integrated a customized version of the html2text library for more control over Markdown generation
- New configuration options for controlling external content:
- Ability to exclude all external links
- Option to specify domains to exclude (default includes major social media platforms)
- Control over excluding external images
### Changed
- Improved Markdown generation process:
- Added fine-grained control over character escaping in Markdown output
- Enhanced handling of code blocks and pre-formatted text
- Updated `AsyncPlaywrightCrawlerStrategy.close()` method to use a shorter sleep time (0.5 seconds instead of 500)
- Enhanced flexibility in `CosineStrategy` with a more generic `load_HF_embedding_model` function
### Improved
- Optimized content scraping and processing for better efficiency
- Enhanced error handling and logging in various components
### Developer Notes
- The customized html2text library is now located within the crawl4ai package
- New configuration options are available in the `config.py` file for external content handling
- The `WebScrappingStrategy` class has been updated to accommodate new external content exclusion options
## [v0.3.71] - 2024-10-19
### Added
- New chunking strategies:
- `OverlappingWindowChunking`: Allows for overlapping chunks of text, useful for maintaining context between chunks.
- Enhanced `SlidingWindowChunking`: Improved to handle edge cases and last chunks more effectively.
### Changed
- Updated `CHUNK_TOKEN_THRESHOLD` in config to 2048 tokens (2^11) for better compatibility with most LLM models.
- Improved `AsyncPlaywrightCrawlerStrategy.close()` method to use a shorter sleep time (0.5 seconds instead of 500), significantly reducing wait time when closing the crawler.
- Enhanced flexibility in `CosineStrategy`:
- Now uses a more generic `load_HF_embedding_model` function, allowing for easier swapping of embedding models.
- Updated `JsonCssExtractionStrategy` and `JsonXPATHExtractionStrategy` for better JSON-based extraction.
### Fixed
- Addressed potential issues with the sliding window chunking strategy to ensure all text is properly chunked.
### Developer Notes
- Added more comprehensive docstrings to chunking strategies for better code documentation.
- Removed hardcoded device setting in `CosineStrategy`, now using the automatically detected device.
- Added a new example in `quickstart_async.py` for generating a knowledge graph from crawled content.
These updates aim to provide more flexibility in text processing, improve performance, and enhance the overall capabilities of the crawl4ai library. The new chunking strategies, in particular, offer more options for handling large texts in various scenarios.
## [v0.3.71] - 2024-10-18
### Changes
1. **Version Update**:
- Updated version number from 0.3.7 to 0.3.71.
2. **Crawler Enhancements**:
- Added `sleep_on_close` option to AsyncPlaywrightCrawlerStrategy for delayed browser closure.
- Improved context creation with additional options:
- Enabled `accept_downloads` and `java_script_enabled`.
- Added a cookie to enable cookies by default.
3. **Error Handling Improvements**:
- Enhanced error messages in AsyncWebCrawler's `arun` method.
- Updated error reporting format for better visibility and consistency.
4. **Performance Optimization**:
- Commented out automatic page and context closure in `crawl` method to potentially improve performance in certain scenarios.
### Documentation
- Updated quickstart notebook:
- Changed installation command to use the released package instead of GitHub repository.
- Updated kernel display name.
### Developer Notes
- Minor code refactoring and cleanup.
## [v0.3.7] - 2024-10-17
### New Features
1. **Enhanced Browser Stealth**:
- Implemented `playwright_stealth` for improved bot detection avoidance.
- Added `StealthConfig` for fine-tuned control over stealth parameters.
2. **User Simulation**:
- New `simulate_user` option to mimic human-like interactions (mouse movements, clicks, keyboard presses).
3. **Navigator Override**:
- Added `override_navigator` option to modify navigator properties, further improving bot detection evasion.
4. **Improved iframe Handling**:
- New `process_iframes` parameter to extract and integrate iframe content into the main page.
5. **Flexible Browser Selection**:
- Support for choosing between Chromium, Firefox, and WebKit browsers.
6. **Include Links in Markdown**:
- Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.
### Improvements
1. **Better Error Handling**:
- Enhanced error reporting in WebScrappingStrategy with detailed error messages and suggestions.
- Added console message and error logging for better debugging.
2. **Image Processing Enhancements**:
- Improved image dimension updating and filtering logic.
3. **Crawling Flexibility**:
- Added support for custom viewport sizes.
- Implemented delayed content retrieval with `delay_before_return_html` parameter.
4. **Performance Optimization**:
- Adjusted default semaphore count for parallel crawling.
### Bug Fixes
- Fixed an issue where the HTML content could be empty after processing.
### Examples
- Added new example `crawl_with_user_simulation()` demonstrating the use of user simulation and navigator override features.
### Developer Notes
- Refactored code for better maintainability and readability.
- Updated browser launch arguments for improved compatibility and performance.
## [v0.3.6] - 2024-10-12
### 1. Improved Crawling Control

View File

@@ -8,14 +8,16 @@
Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
## New in 0.3.72 ✨
> Looking for the synchronous version? Check out [README.sync.md](./README.sync.md). You can also access the previous version in the branch [V0.2.76](https://github.com/unclecode/crawl4ai/blob/v0.2.76).
- 📄 Fit markdown generation for extracting main article content.
- 🪄 Magic mode for comprehensive anti-bot detection bypass.
- 🌐 Enhanced multi-browser support with seamless switching (Chromium, Firefox, WebKit)
- 📚 New chunking strategies(Sliding window, Overlapping window, Flexible size control)
- 💾 Improved caching system for better performance
- ⚡ Optimized batch processing with automatic rate limiting
## New update 0.3.6
- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
- 🖼️ Improved image processing with lazy-loading detection
- 🔧 Custom page timeout parameter for better control over crawling behavior
- 🕰️ Enhanced handling of delayed content loading
- 🔑 Custom headers support for LLM interactions
- 🖼️ iframe content extraction for comprehensive page analysis
- ⏱️ Flexible timeout and delayed content retrieval options
## Try it Now!
@@ -28,28 +30,22 @@ Crawl4AI simplifies asynchronous web crawling and data extraction, making it acc
- 🆓 Completely free and open-source
- 🚀 Blazing fast performance, outperforming many paid services
- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
- 🌍 Supports crawling multiple URLs simultaneously
- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
- 🔗 Extracts all external and internal links
- 📚 Extracts metadata from the page
- 🔄 Custom hooks for authentication, headers, and page modifications
- 🔄 Custom hooks for authentication, headers, and page modifications before crawling
- 🕵️ User-agent customization
- 🖼️ Takes screenshots of pages with enhanced error handling
- 🖼️ Takes screenshots of the page
- 📜 Executes multiple custom JavaScripts before crawling
- 📊 Generates structured output without LLM using JsonCssExtractionStrategy
- 📚 Various chunking strategies: topic-based, regex, sentence, and more
- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
- 🎯 CSS selector support for precise data extraction
- 📝 Passes instructions/keywords to refine extraction
- 🔒 Proxy support with authentication for enhanced access
- 🔄 Session management for complex multi-page crawling
- 🌐 Asynchronous architecture for improved performance
- 🖼️ Improved image processing with lazy-loading detection
- 🕰️ Enhanced handling of delayed content loading
- 🔑 Custom headers support for LLM interactions
- 🖼️ iframe content extraction for comprehensive analysis
- ⏱️ Flexible timeout and delayed content retrieval options
- 🔒 Proxy support for enhanced privacy and access
- 🔄 Session management for complex multi-page crawling scenarios
- 🌐 Asynchronous architecture for improved performance and scalability
## Installation 🛠️

View File

@@ -3,7 +3,7 @@
from .async_webcrawler import AsyncWebCrawler
from .models import CrawlResult
__version__ = "0.3.72"
__version__ = "0.3.6"
__all__ = [
"AsyncWebCrawler",

View File

@@ -1,558 +0,0 @@
import asyncio
import base64
import time
from abc import ABC, abstractmethod
from typing import Callable, Dict, Any, List, Optional, Awaitable
import os
from playwright.async_api import async_playwright, Page, Browser, Error
from io import BytesIO
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path
from playwright.async_api import ProxySettings
from pydantic import BaseModel
import hashlib
import json
import uuid
from playwright_stealth import stealth_async
class AsyncCrawlResponse(BaseModel):
html: str
response_headers: Dict[str, str]
status_code: int
screenshot: Optional[str] = None
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
class Config:
arbitrary_types_allowed = True
class AsyncCrawlerStrategy(ABC):
@abstractmethod
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
pass
@abstractmethod
async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
pass
@abstractmethod
async def take_screenshot(self, url: str) -> str:
pass
@abstractmethod
def update_user_agent(self, user_agent: str):
pass
@abstractmethod
def set_hook(self, hook_type: str, hook: Callable):
pass
class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
def __init__(self, use_cached_html=False, js_code=None, **kwargs):
self.use_cached_html = use_cached_html
self.user_agent = kwargs.get(
"user_agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
)
self.proxy = kwargs.get("proxy")
self.headless = kwargs.get("headless", True)
self.browser_type = kwargs.get("browser_type", "chromium")
self.headers = kwargs.get("headers", {})
self.sessions = {}
self.session_ttl = 1800
self.js_code = js_code
self.verbose = kwargs.get("verbose", False)
self.playwright = None
self.browser = None
self.hooks = {
'on_browser_created': None,
'on_user_agent_updated': None,
'on_execution_started': None,
'before_goto': None,
'after_goto': None,
'before_return_html': None,
'before_retrieve_html': None
}
async def __aenter__(self):
await self.start()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
await self.close()
async def start(self):
if self.playwright is None:
self.playwright = await async_playwright().start()
if self.browser is None:
browser_args = {
"headless": self.headless,
"args": [
"--disable-gpu",
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-blink-features=AutomationControlled",
"--disable-infobars",
"--window-position=0,0",
"--ignore-certificate-errors",
"--ignore-certificate-errors-spki-list",
# "--headless=new", # Use the new headless mode
]
}
# Add proxy settings if a proxy is specified
if self.proxy:
proxy_settings = ProxySettings(server=self.proxy)
browser_args["proxy"] = proxy_settings
# Select the appropriate browser based on the browser_type
if self.browser_type == "firefox":
self.browser = await self.playwright.firefox.launch(**browser_args)
elif self.browser_type == "webkit":
self.browser = await self.playwright.webkit.launch(**browser_args)
else:
self.browser = await self.playwright.chromium.launch(**browser_args)
await self.execute_hook('on_browser_created', self.browser)
async def close(self):
if self.browser:
await self.browser.close()
self.browser = None
if self.playwright:
await self.playwright.stop()
self.playwright = None
def __del__(self):
if self.browser or self.playwright:
asyncio.get_event_loop().run_until_complete(self.close())
def set_hook(self, hook_type: str, hook: Callable):
if hook_type in self.hooks:
self.hooks[hook_type] = hook
else:
raise ValueError(f"Invalid hook type: {hook_type}")
async def execute_hook(self, hook_type: str, *args):
hook = self.hooks.get(hook_type)
if hook:
if asyncio.iscoroutinefunction(hook):
return await hook(*args)
else:
return hook(*args)
return args[0] if args else None
def update_user_agent(self, user_agent: str):
self.user_agent = user_agent
def set_custom_headers(self, headers: Dict[str, str]):
self.headers = headers
async def kill_session(self, session_id: str):
if session_id in self.sessions:
context, page, _ = self.sessions[session_id]
await page.close()
await context.close()
del self.sessions[session_id]
def _cleanup_expired_sessions(self):
current_time = time.time()
expired_sessions = [
sid for sid, (_, _, last_used) in self.sessions.items()
if current_time - last_used > self.session_ttl
]
for sid in expired_sessions:
asyncio.create_task(self.kill_session(sid))
async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
wait_for = wait_for.strip()
if wait_for.startswith('js:'):
# Explicitly specified JavaScript
js_code = wait_for[3:].strip()
return await self.csp_compliant_wait(page, js_code, timeout)
elif wait_for.startswith('css:'):
# Explicitly specified CSS selector
css_selector = wait_for[4:].strip()
try:
await page.wait_for_selector(css_selector, timeout=timeout)
except Error as e:
if 'Timeout' in str(e):
raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{css_selector}'")
else:
raise ValueError(f"Invalid CSS selector: '{css_selector}'")
else:
# Auto-detect based on content
if wait_for.startswith('()') or wait_for.startswith('function'):
# It's likely a JavaScript function
return await self.csp_compliant_wait(page, wait_for, timeout)
else:
# Assume it's a CSS selector first
try:
await page.wait_for_selector(wait_for, timeout=timeout)
except Error as e:
if 'Timeout' in str(e):
raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{wait_for}'")
else:
# If it's not a timeout error, it might be an invalid selector
# Let's try to evaluate it as a JavaScript function as a fallback
try:
return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
except Error:
raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
"It should be either a valid CSS selector, a JavaScript function, "
"or explicitly prefixed with 'js:' or 'css:'.")
async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
wrapper_js = f"""
async () => {{
const userFunction = {user_wait_function};
const startTime = Date.now();
while (true) {{
if (await userFunction()) {{
return true;
}}
if (Date.now() - startTime > {timeout}) {{
throw new Error('Timeout waiting for condition');
}}
await new Promise(resolve => setTimeout(resolve, 100));
}}
}}
"""
try:
await page.evaluate(wrapper_js)
except TimeoutError:
raise TimeoutError(f"Timeout after {timeout}ms waiting for condition")
except Exception as e:
raise RuntimeError(f"Error in wait condition: {str(e)}")
async def process_iframes(self, page):
# Find all iframes
iframes = await page.query_selector_all('iframe')
for i, iframe in enumerate(iframes):
try:
# Add a unique identifier to the iframe
await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
# Get the frame associated with this iframe
frame = await iframe.content_frame()
if frame:
# Wait for the frame to load
await frame.wait_for_load_state('load', timeout=30000) # 30 seconds timeout
# Extract the content of the iframe's body
iframe_content = await frame.evaluate('() => document.body.innerHTML')
# Generate a unique class name for this iframe
class_name = f'extracted-iframe-content-{i}'
# Replace the iframe with a div containing the extracted content
_iframe = iframe_content.replace('`', '\\`')
await page.evaluate(f"""
() => {{
const iframe = document.getElementById('iframe-{i}');
const div = document.createElement('div');
div.innerHTML = `{_iframe}`;
div.className = '{class_name}';
iframe.replaceWith(div);
}}
""")
else:
print(f"Warning: Could not access content frame for iframe {i}")
except Exception as e:
print(f"Error processing iframe {i}: {str(e)}")
# Return the page object
return page
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
response_headers = {}
status_code = None
self._cleanup_expired_sessions()
session_id = kwargs.get("session_id")
if session_id:
context, page, _ = self.sessions.get(session_id, (None, None, None))
if not context:
context = await self.browser.new_context(
user_agent=self.user_agent,
viewport={"width": 1920, "height": 1080},
proxy={"server": self.proxy} if self.proxy else None
)
await context.set_extra_http_headers(self.headers)
page = await context.new_page()
self.sessions[session_id] = (context, page, time.time())
else:
context = await self.browser.new_context(
user_agent=self.user_agent,
viewport={"width": 1920, "height": 1080},
proxy={"server": self.proxy} if self.proxy else None
)
await context.set_extra_http_headers(self.headers)
if kwargs.get("override_navigator", False):
# Inject scripts to override navigator properties
await context.add_init_script("""
// Pass the Permissions Test.
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications' ?
Promise.resolve({ state: Notification.permission }) :
originalQuery(parameters)
);
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
window.navigator.chrome = {
runtime: {},
// Add other properties if necessary
};
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5],
});
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en'],
});
Object.defineProperty(document, 'hidden', {
get: () => false
});
Object.defineProperty(document, 'visibilityState', {
get: () => 'visible'
});
""")
page = await context.new_page()
try:
if self.verbose:
print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
if self.use_cached_html:
cache_file_path = os.path.join(
Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
)
if os.path.exists(cache_file_path):
html = ""
with open(cache_file_path, "r") as f:
html = f.read()
# retrieve response headers and status code from cache
with open(cache_file_path + ".meta", "r") as f:
meta = json.load(f)
response_headers = meta.get("response_headers", {})
status_code = meta.get("status_code")
response = AsyncCrawlResponse(
html=html, response_headers=response_headers, status_code=status_code
)
return response
if not kwargs.get("js_only", False):
await self.execute_hook('before_goto', page)
response = await page.goto("about:blank")
await stealth_async(page)
response = await page.goto(
url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
)
# await stealth_async(page)
# response = await page.goto("about:blank")
# await stealth_async(page)
# await page.evaluate(f"window.location.href = '{url}'")
await self.execute_hook('after_goto', page)
# Get status code and headers
status_code = response.status
response_headers = response.headers
else:
status_code = 200
response_headers = {}
await page.wait_for_selector('body')
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
if js_code:
if isinstance(js_code, str):
await page.evaluate(js_code)
elif isinstance(js_code, list):
for js in js_code:
await page.evaluate(js)
await page.wait_for_load_state('networkidle')
# Check for on execution event
await self.execute_hook('on_execution_started', page)
if kwargs.get("simulate_user", False):
# Simulate user interactions
await page.mouse.move(100, 100)
await page.mouse.down()
await page.mouse.up()
await page.keyboard.press('ArrowDown')
# Handle the wait_for parameter
wait_for = kwargs.get("wait_for")
if wait_for:
try:
await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
except Exception as e:
raise RuntimeError(f"Wait condition failed: {str(e)}")
# Update image dimensions
update_image_dimensions_js = """
() => {
return new Promise((resolve) => {
const filterImage = (img) => {
// Filter out images that are too small
if (img.width < 100 && img.height < 100) return false;
// Filter out images that are not visible
const rect = img.getBoundingClientRect();
if (rect.width === 0 || rect.height === 0) return false;
// Filter out images with certain class names (e.g., icons, thumbnails)
if (img.classList.contains('icon') || img.classList.contains('thumbnail')) return false;
// Filter out images with certain patterns in their src (e.g., placeholder images)
if (img.src.includes('placeholder') || img.src.includes('icon')) return false;
return true;
};
const images = Array.from(document.querySelectorAll('img')).filter(filterImage);
let imagesLeft = images.length;
if (imagesLeft === 0) {
resolve();
return;
}
const checkImage = (img) => {
if (img.complete && img.naturalWidth !== 0) {
img.setAttribute('width', img.naturalWidth);
img.setAttribute('height', img.naturalHeight);
imagesLeft--;
if (imagesLeft === 0) resolve();
}
};
images.forEach(img => {
checkImage(img);
if (!img.complete) {
img.onload = () => {
checkImage(img);
};
img.onerror = () => {
imagesLeft--;
if (imagesLeft === 0) resolve();
};
}
});
// Fallback timeout of 5 seconds
setTimeout(() => resolve(), 5000);
});
}
"""
await page.evaluate(update_image_dimensions_js)
# Wait a bit for any onload events to complete
await page.wait_for_timeout(100)
# Process iframes
if kwargs.get("process_iframes", False):
page = await self.process_iframes(page)
await self.execute_hook('before_retrieve_html', page)
# Check if delay_before_return_html is set then wait for that time
delay_before_return_html = kwargs.get("delay_before_return_html")
if delay_before_return_html:
await asyncio.sleep(delay_before_return_html)
html = await page.content()
await self.execute_hook('before_return_html', page, html)
# Check if kwargs has screenshot=True then take screenshot
screenshot_data = None
if kwargs.get("screenshot"):
screenshot_data = await self.take_screenshot(url)
if self.verbose:
print(f"[LOG] ✅ Crawled {url} successfully!")
if self.use_cached_html:
cache_file_path = os.path.join(
Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
)
with open(cache_file_path, "w", encoding="utf-8") as f:
f.write(html)
# store response headers and status code in cache
with open(cache_file_path + ".meta", "w", encoding="utf-8") as f:
json.dump({
"response_headers": response_headers,
"status_code": status_code
}, f)
async def get_delayed_content(delay: float = 5.0) -> str:
if self.verbose:
print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
await asyncio.sleep(delay)
return await page.content()
response = AsyncCrawlResponse(
html=html,
response_headers=response_headers,
status_code=status_code,
screenshot=screenshot_data,
get_delayed_content=get_delayed_content
)
return response
except Error as e:
raise Error(f"Failed to crawl {url}: {str(e)}")
finally:
if not session_id:
await page.close()
await context.close()
async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
semaphore_count = kwargs.get('semaphore_count', 5) # Adjust as needed
semaphore = asyncio.Semaphore(semaphore_count)
async def crawl_with_semaphore(url):
async with semaphore:
return await self.crawl(url, **kwargs)
tasks = [crawl_with_semaphore(url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [result if not isinstance(result, Exception) else str(result) for result in results]
async def take_screenshot(self, url: str, wait_time=1000) -> str:
async with await self.browser.new_context(user_agent=self.user_agent) as context:
page = await context.new_page()
try:
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
# Wait for a specified time (default is 1 second)
await page.wait_for_timeout(wait_time)
screenshot = await page.screenshot(full_page=True)
return base64.b64encode(screenshot).decode('utf-8')
except Exception as e:
error_message = f"Failed to take screenshot: {str(e)}"
print(error_message)
# Generate an error image
img = Image.new('RGB', (800, 600), color='black')
draw = ImageDraw.Draw(img)
font = ImageFont.load_default()
draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
buffered = BytesIO()
img.save(buffered, format="JPEG")
return base64.b64encode(buffered.getvalue()).decode('utf-8')
finally:
await page.close()

View File

@@ -1,35 +1,17 @@
import asyncio
import base64
import time
import base64, time
from abc import ABC, abstractmethod
from typing import Callable, Dict, Any, List, Optional, Awaitable
import os
from playwright.async_api import async_playwright, Page, Browser, Error
from io import BytesIO
from PIL import Image, ImageDraw, ImageFont
from .utils import sanitize_input_encode, calculate_semaphore_count
import json, uuid
import hashlib
from pathlib import Path
from playwright.async_api import ProxySettings
from pydantic import BaseModel
import hashlib
import json
import uuid
from playwright_stealth import StealthConfig, stealth_async
stealth_config = StealthConfig(
webdriver=True,
chrome_app=True,
chrome_csi=True,
chrome_load_times=True,
chrome_runtime=True,
navigator_languages=True,
navigator_plugins=True,
navigator_permissions=True,
webgl_vendor=True,
outerdimensions=True,
navigator_hardware_concurrency=True,
media_codecs=True,
)
class AsyncCrawlResponse(BaseModel):
html: str
@@ -51,7 +33,7 @@ class AsyncCrawlerStrategy(ABC):
pass
@abstractmethod
async def take_screenshot(self, **kwargs) -> str:
async def take_screenshot(self, url: str) -> str:
pass
@abstractmethod
@@ -65,15 +47,10 @@ class AsyncCrawlerStrategy(ABC):
class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
def __init__(self, use_cached_html=False, js_code=None, **kwargs):
self.use_cached_html = use_cached_html
self.user_agent = kwargs.get(
"user_agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
)
self.user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
self.proxy = kwargs.get("proxy")
self.proxy_config = kwargs.get("proxy_config")
self.headless = kwargs.get("headless", True)
self.browser_type = kwargs.get("browser_type", "chromium")
self.browser_type = kwargs.get("browser_type", "chromium") # New parameter
self.headers = kwargs.get("headers", {})
self.sessions = {}
self.session_ttl = 1800
@@ -81,7 +58,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
self.verbose = kwargs.get("verbose", False)
self.playwright = None
self.browser = None
self.sleep_on_close = kwargs.get("sleep_on_close", False)
self.hooks = {
'on_browser_created': None,
'on_user_agent_updated': None,
@@ -107,14 +83,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
"headless": self.headless,
"args": [
"--disable-gpu",
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-blink-features=AutomationControlled",
"--disable-infobars",
"--window-position=0,0",
"--ignore-certificate-errors",
"--ignore-certificate-errors-spki-list",
# "--headless=new", # Use the new headless mode
"--disable-setuid-sandbox",
"--no-sandbox",
]
}
@@ -122,9 +93,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
if self.proxy:
proxy_settings = ProxySettings(server=self.proxy)
browser_args["proxy"] = proxy_settings
elif self.proxy_config:
proxy_settings = ProxySettings(server=self.proxy_config.get("server"), username=self.proxy_config.get("username"), password=self.proxy_config.get("password"))
browser_args["proxy"] = proxy_settings
# Select the appropriate browser based on the browser_type
if self.browser_type == "firefox":
@@ -137,8 +106,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
await self.execute_hook('on_browser_created', self.browser)
async def close(self):
if self.sleep_on_close:
await asyncio.sleep(0.5)
if self.browser:
await self.browser.close()
self.browser = None
@@ -180,10 +147,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
def _cleanup_expired_sessions(self):
current_time = time.time()
expired_sessions = [
sid for sid, (_, _, last_used) in self.sessions.items()
if current_time - last_used > self.session_ttl
]
expired_sessions = [sid for sid, (_, _, last_used) in self.sessions.items()
if current_time - last_used > self.session_ttl]
for sid in expired_sessions:
asyncio.create_task(self.kill_session(sid))
@@ -223,8 +188,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
except Error:
raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
"It should be either a valid CSS selector, a JavaScript function, "
"or explicitly prefixed with 'js:' or 'css:'.")
"It should be either a valid CSS selector, a JavaScript function, "
"or explicitly prefixed with 'js:' or 'css:'.")
async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
wrapper_js = f"""
@@ -289,7 +254,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
print(f"Error processing iframe {i}: {str(e)}")
# Return the page object
return page
return page
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
response_headers = {}
@@ -302,70 +268,25 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
if not context:
context = await self.browser.new_context(
user_agent=self.user_agent,
viewport={"width": 1920, "height": 1080},
proxy={"server": self.proxy} if self.proxy else None,
accept_downloads=True,
java_script_enabled=True
proxy={"server": self.proxy} if self.proxy else None
)
await context.add_cookies([{"name": "cookiesEnabled", "value": "true", "url": url}])
await context.set_extra_http_headers(self.headers)
page = await context.new_page()
self.sessions[session_id] = (context, page, time.time())
else:
context = await self.browser.new_context(
user_agent=self.user_agent,
viewport={"width": 1920, "height": 1080},
proxy={"server": self.proxy} if self.proxy else None
user_agent=self.user_agent,
proxy={"server": self.proxy} if self.proxy else None
)
await context.set_extra_http_headers(self.headers)
if kwargs.get("override_navigator", False) or kwargs.get("simulate_user", False) or kwargs.get("magic", False):
# Inject scripts to override navigator properties
await context.add_init_script("""
// Pass the Permissions Test.
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications' ?
Promise.resolve({ state: Notification.permission }) :
originalQuery(parameters)
);
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
window.navigator.chrome = {
runtime: {},
// Add other properties if necessary
};
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5],
});
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en'],
});
Object.defineProperty(document, 'hidden', {
get: () => false
});
Object.defineProperty(document, 'visibilityState', {
get: () => 'visible'
});
""")
page = await context.new_page()
# await stealth_async(page) #, stealth_config)
# Add console message and error logging
if kwargs.get("log_console", False):
page.on("console", lambda msg: print(f"Console: {msg.text}"))
page.on("pageerror", lambda exc: print(f"Page Error: {exc}"))
try:
if self.verbose:
print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
if self.use_cached_html:
cache_file_path = os.path.join(
Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
)
cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest())
if os.path.exists(cache_file_path):
html = ""
with open(cache_file_path, "r") as f:
@@ -375,21 +296,12 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
meta = json.load(f)
response_headers = meta.get("response_headers", {})
status_code = meta.get("status_code")
response = AsyncCrawlResponse(
html=html, response_headers=response_headers, status_code=status_code
)
response = AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
return response
if not kwargs.get("js_only", False):
await self.execute_hook('before_goto', page)
response = await page.goto(
url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
)
# response = await page.goto("about:blank")
# await page.evaluate(f"window.location.href = '{url}'")
response = await page.goto(url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000))
await self.execute_hook('after_goto', page)
# Get status code and headers
@@ -399,30 +311,37 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
status_code = 200
response_headers = {}
await page.wait_for_selector('body')
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
if js_code:
if isinstance(js_code, str):
await page.evaluate(js_code)
r = await page.evaluate(js_code)
elif isinstance(js_code, list):
for js in js_code:
await page.evaluate(js)
# await page.wait_for_timeout(100)
await page.wait_for_load_state('networkidle')
# Check for on execution event
# Check for on execution even
await self.execute_hook('on_execution_started', page)
if kwargs.get("simulate_user", False) or kwargs.get("magic", False):
# Simulate user interactions
await page.mouse.move(100, 100)
await page.mouse.down()
await page.mouse.up()
await page.keyboard.press('ArrowDown')
# Handle the wait_for parameter
# New code to handle the wait_for parameter
# Example usage:
# await crawler.crawl(
# url,
# js_code="// some JavaScript code",
# wait_for="""() => {
# return document.querySelector('#my-element') !== null;
# }"""
# )
# Example of using a CSS selector:
# await crawler.crawl(
# url,
# wait_for="#my-element"
# )
wait_for = kwargs.get("wait_for")
if wait_for:
try:
@@ -430,7 +349,13 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
except Exception as e:
raise RuntimeError(f"Wait condition failed: {str(e)}")
# Update image dimensions
# Check if kwargs has screenshot=True then take screenshot
screenshot_data = None
if kwargs.get("screenshot"):
screenshot_data = await self.take_screenshot(url)
# New code to update image dimensions
update_image_dimensions_js = """
() => {
return new Promise((resolve) => {
@@ -482,8 +407,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
});
// Fallback timeout of 5 seconds
// setTimeout(() => resolve(), 5000);
resolve();
setTimeout(() => resolve(), 5000);
});
}
"""
@@ -502,29 +426,14 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
if delay_before_return_html:
await asyncio.sleep(delay_before_return_html)
# Check for remove_overlay_elements parameter
if kwargs.get("remove_overlay_elements", False):
await self.remove_overlay_elements(page)
html = await page.content()
await self.execute_hook('before_return_html', page, html)
# Check if kwargs has screenshot=True then take screenshot
screenshot_data = None
if kwargs.get("screenshot"):
# Check we have screenshot_wait_for parameter, if we have simply wait for that time
screenshot_wait_for = kwargs.get("screenshot_wait_for")
if screenshot_wait_for:
await asyncio.sleep(screenshot_wait_for)
screenshot_data = await self.take_screenshot(page)
if self.verbose:
print(f"[LOG] ✅ Crawled {url} successfully!")
if self.use_cached_html:
cache_file_path = os.path.join(
Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
)
cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest())
with open(cache_file_path, "w", encoding="utf-8") as f:
f.write(html)
# store response headers and status code in cache
@@ -534,6 +443,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
"status_code": status_code
}, f)
async def get_delayed_content(delay: float = 5.0) -> str:
if self.verbose:
print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
@@ -549,14 +459,63 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
)
return response
except Error as e:
raise Error(f"[ERROR] 🚫 crawl(): Failed to crawl {url}: {str(e)}")
# finally:
# if not session_id:
# await page.close()
# await context.close()
raise Error(f"Failed to crawl {url}: {str(e)}")
finally:
if not session_id:
await page.close()
# try:
# html = await _crawl()
# return sanitize_input_encode(html)
# except Error as e:
# raise Error(f"Failed to crawl {url}: {str(e)}")
# except Exception as e:
# raise Exception(f"Failed to crawl {url}: {str(e)}")
async def execute_js(self, session_id: str, js_code: str, wait_for_js: str = None, wait_for_css: str = None) -> AsyncCrawlResponse:
"""
Execute JavaScript code in a specific session and optionally wait for a condition.
:param session_id: The ID of the session to execute the JS code in.
:param js_code: The JavaScript code to execute.
:param wait_for_js: JavaScript condition to wait for after execution.
:param wait_for_css: CSS selector to wait for after execution.
:return: AsyncCrawlResponse containing the page's HTML and other information.
:raises ValueError: If the session does not exist.
"""
if not session_id:
raise ValueError("Session ID must be provided")
if session_id not in self.sessions:
raise ValueError(f"No active session found for session ID: {session_id}")
context, page, last_used = self.sessions[session_id]
try:
await page.evaluate(js_code)
if wait_for_js:
await page.wait_for_function(wait_for_js)
if wait_for_css:
await page.wait_for_selector(wait_for_css)
# Get the updated HTML content
html = await page.content()
# Get response headers and status code (assuming these are available)
response_headers = await page.evaluate("() => JSON.stringify(performance.getEntriesByType('resource')[0].responseHeaders)")
status_code = await page.evaluate("() => performance.getEntriesByType('resource')[0].responseStatus")
# Update the last used time for this session
self.sessions[session_id] = (context, page, time.time())
return AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
except Error as e:
raise Error(f"Failed to execute JavaScript or wait for condition in session {session_id}: {str(e)}")
async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
semaphore_count = kwargs.get('semaphore_count', 5) # Adjust as needed
semaphore_count = kwargs.get('semaphore_count', calculate_semaphore_count())
semaphore = asyncio.Semaphore(semaphore_count)
async def crawl_with_semaphore(url):
@@ -567,156 +526,27 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
results = await asyncio.gather(*tasks, return_exceptions=True)
return [result if not isinstance(result, Exception) else str(result) for result in results]
async def remove_overlay_elements(self, page: Page) -> None:
"""
Removes popup overlays, modals, cookie notices, and other intrusive elements from the page.
Args:
page (Page): The Playwright page instance
"""
remove_overlays_js = """
async () => {
// Function to check if element is visible
const isVisible = (elem) => {
const style = window.getComputedStyle(elem);
return style.display !== 'none' &&
style.visibility !== 'hidden' &&
style.opacity !== '0';
};
async def take_screenshot(self, url: str, wait_time = 1000) -> str:
async with await self.browser.new_context(user_agent=self.user_agent) as context:
page = await context.new_page()
try:
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
# Wait for a specified time (default is 1 second)
await page.wait_for_timeout(wait_time)
screenshot = await page.screenshot(full_page=True)
return base64.b64encode(screenshot).decode('utf-8')
except Exception as e:
error_message = f"Failed to take screenshot: {str(e)}"
print(error_message)
// Common selectors for popups and overlays
const commonSelectors = [
// Close buttons first
'button[class*="close" i]', 'button[class*="dismiss" i]',
'button[aria-label*="close" i]', 'button[title*="close" i]',
'a[class*="close" i]', 'span[class*="close" i]',
# Generate an error image
img = Image.new('RGB', (800, 600), color='black')
draw = ImageDraw.Draw(img)
font = ImageFont.load_default()
draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
// Cookie notices
'[class*="cookie-banner" i]', '[id*="cookie-banner" i]',
'[class*="cookie-consent" i]', '[id*="cookie-consent" i]',
// Newsletter/subscription dialogs
'[class*="newsletter" i]', '[class*="subscribe" i]',
// Generic popups/modals
'[class*="popup" i]', '[class*="modal" i]',
'[class*="overlay" i]', '[class*="dialog" i]',
'[role="dialog"]', '[role="alertdialog"]'
];
// Try to click close buttons first
for (const selector of commonSelectors.slice(0, 6)) {
const closeButtons = document.querySelectorAll(selector);
for (const button of closeButtons) {
if (isVisible(button)) {
try {
button.click();
await new Promise(resolve => setTimeout(resolve, 100));
} catch (e) {
console.log('Error clicking button:', e);
}
}
}
}
// Remove remaining overlay elements
const removeOverlays = () => {
// Find elements with high z-index
const allElements = document.querySelectorAll('*');
for (const elem of allElements) {
const style = window.getComputedStyle(elem);
const zIndex = parseInt(style.zIndex);
const position = style.position;
if (
isVisible(elem) &&
(zIndex > 999 || position === 'fixed' || position === 'absolute') &&
(
elem.offsetWidth > window.innerWidth * 0.5 ||
elem.offsetHeight > window.innerHeight * 0.5 ||
style.backgroundColor.includes('rgba') ||
parseFloat(style.opacity) < 1
)
) {
elem.remove();
}
}
// Remove elements matching common selectors
for (const selector of commonSelectors) {
const elements = document.querySelectorAll(selector);
elements.forEach(elem => {
if (isVisible(elem)) {
elem.remove();
}
});
}
};
// Remove overlay elements
removeOverlays();
// Remove any fixed/sticky position elements at the top/bottom
const removeFixedElements = () => {
const elements = document.querySelectorAll('*');
elements.forEach(elem => {
const style = window.getComputedStyle(elem);
if (
(style.position === 'fixed' || style.position === 'sticky') &&
isVisible(elem)
) {
elem.remove();
}
});
};
removeFixedElements();
// Remove empty block elements as: div, p, span, etc.
const removeEmptyBlockElements = () => {
const blockElements = document.querySelectorAll('div, p, span, section, article, header, footer, aside, nav, main, ul, ol, li, dl, dt, dd, h1, h2, h3, h4, h5, h6');
blockElements.forEach(elem => {
if (elem.innerText.trim() === '') {
elem.remove();
}
});
};
// Remove margin-right and padding-right from body (often added by modal scripts)
document.body.style.marginRight = '0px';
document.body.style.paddingRight = '0px';
document.body.style.overflow = 'auto';
// Wait a bit for any animations to complete
await new Promise(resolve => setTimeout(resolve, 100));
}
"""
try:
await page.evaluate(remove_overlays_js)
await page.wait_for_timeout(500) # Wait for any animations to complete
except Exception as e:
if self.verbose:
print(f"Warning: Failed to remove overlay elements: {str(e)}")
async def take_screenshot(self, page: Page) -> str:
try:
# The page is already loaded, just take the screenshot
screenshot = await page.screenshot(full_page=True)
return base64.b64encode(screenshot).decode('utf-8')
except Exception as e:
error_message = f"Failed to take screenshot: {str(e)}"
print(error_message)
# Generate an error image
img = Image.new('RGB', (800, 600), color='black')
draw = ImageDraw.Draw(img)
font = ImageFont.load_default()
draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
buffered = BytesIO()
img.save(buffered, format="JPEG")
return base64.b64encode(buffered.getvalue()).decode('utf-8')
finally:
await page.close()
buffered = BytesIO()
img.save(buffered, format="JPEG")
return base64.b64encode(buffered.getvalue()).decode('utf-8')
finally:
await page.close()

View File

@@ -23,15 +23,13 @@ class AsyncWebCrawler:
self,
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
always_by_pass_cache: bool = False,
base_directory: str = str(Path.home()),
**kwargs,
):
self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
**kwargs
)
self.always_by_pass_cache = always_by_pass_cache
# self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
self.crawl4ai_folder = os.path.join(base_directory, ".crawl4ai")
self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
os.makedirs(self.crawl4ai_folder, exist_ok=True)
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
self.ready = False
@@ -135,8 +133,8 @@ class AsyncWebCrawler:
except Exception as e:
if not hasattr(e, "msg"):
e.msg = str(e)
print(f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}")
return CrawlResult(url=url, html="", markdown = f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}", success=False, error_message=e.msg)
print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
return CrawlResult(url=url, html="", success=False, error_message=e.msg)
async def arun_many(
self,
@@ -188,8 +186,7 @@ class AsyncWebCrawler:
try:
t1 = time.time()
scrapping_strategy = WebScrappingStrategy()
# result = await scrapping_strategy.ascrap(
result = scrapping_strategy.scrap(
result = await scrapping_strategy.ascrap(
url,
html,
word_count_threshold=word_count_threshold,
@@ -198,7 +195,6 @@ class AsyncWebCrawler:
image_description_min_word_threshold=kwargs.get(
"image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
),
**kwargs,
)
if verbose:
print(
@@ -214,8 +210,6 @@ class AsyncWebCrawler:
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
markdown = sanitize_input_encode(result.get("markdown", ""))
fit_markdown = sanitize_input_encode(result.get("fit_markdown", ""))
fit_html = sanitize_input_encode(result.get("fit_html", ""))
media = result.get("media", [])
links = result.get("links", [])
metadata = result.get("metadata", {})
@@ -262,8 +256,6 @@ class AsyncWebCrawler:
html=html,
cleaned_html=format_html(cleaned_html),
markdown=markdown,
fit_markdown=fit_markdown,
fit_html= fit_html,
media=media,
links=links,
metadata=metadata,

View File

@@ -84,12 +84,6 @@ class TopicSegmentationChunking(ChunkingStrategy):
# Fixed-length word chunks
class FixedLengthWordChunking(ChunkingStrategy):
def __init__(self, chunk_size=100, **kwargs):
"""
Initialize the fixed-length word chunking strategy with the given chunk size.
Args:
chunk_size (int): The size of each chunk in words.
"""
self.chunk_size = chunk_size
def chunk(self, text: str) -> list:
@@ -99,64 +93,14 @@ class FixedLengthWordChunking(ChunkingStrategy):
# Sliding window chunking
class SlidingWindowChunking(ChunkingStrategy):
def __init__(self, window_size=100, step=50, **kwargs):
"""
Initialize the sliding window chunking strategy with the given window size and
step size.
Args:
window_size (int): The size of the sliding window in words.
step (int): The step size for sliding the window in words.
"""
self.window_size = window_size
self.step = step
def chunk(self, text: str) -> list:
words = text.split()
chunks = []
if len(words) <= self.window_size:
return [text]
for i in range(0, len(words) - self.window_size + 1, self.step):
chunk = ' '.join(words[i:i + self.window_size])
chunks.append(chunk)
# Handle the last chunk if it doesn't align perfectly
if i + self.window_size < len(words):
chunks.append(' '.join(words[-self.window_size:]))
for i in range(0, len(words), self.step):
chunks.append(' '.join(words[i:i + self.window_size]))
return chunks
class OverlappingWindowChunking(ChunkingStrategy):
def __init__(self, window_size=1000, overlap=100, **kwargs):
"""
Initialize the overlapping window chunking strategy with the given window size and
overlap size.
Args:
window_size (int): The size of the window in words.
overlap (int): The size of the overlap between consecutive chunks in words.
"""
self.window_size = window_size
self.overlap = overlap
def chunk(self, text: str) -> list:
words = text.split()
chunks = []
if len(words) <= self.window_size:
return [text]
start = 0
while start < len(words):
end = start + self.window_size
chunk = ' '.join(words[start:end])
chunks.append(chunk)
if end >= len(words):
break
start = end - self.overlap
return chunks

View File

@@ -4,23 +4,24 @@ from dotenv import load_dotenv
load_dotenv() # Load environment variables from .env file
# Default provider, ONLY used when the extraction strategy is LLMExtractionStrategy
DEFAULT_PROVIDER = "openai/gpt-4o-mini"
DEFAULT_PROVIDER = "openai/gpt-4-turbo"
MODEL_REPO_BRANCH = "new-release-0.0.2"
# Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
PROVIDER_MODELS = {
"ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
"groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
"groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
"openai/gpt-4o-mini": os.getenv("OPENAI_API_KEY"),
"openai/gpt-3.5-turbo": os.getenv("OPENAI_API_KEY"),
"openai/gpt-4-turbo": os.getenv("OPENAI_API_KEY"),
"openai/gpt-4o": os.getenv("OPENAI_API_KEY"),
"anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
"anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
"anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
"anthropic/claude-3-5-sonnet-20240620": os.getenv("ANTHROPIC_API_KEY"),
}
# Chunk token threshold
CHUNK_TOKEN_THRESHOLD = 2 ** 11 # 2048 tokens
CHUNK_TOKEN_THRESHOLD = 500
OVERLAP_RATE = 0.1
WORD_TOKEN_RATE = 1.3
@@ -28,20 +29,6 @@ WORD_TOKEN_RATE = 1.3
MIN_WORD_THRESHOLD = 1
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD = 1
IMPORTANT_ATTRS = ['src', 'href', 'alt', 'title', 'width', 'height']
ONLY_TEXT_ELIGIBLE_TAGS = ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark']
SOCIAL_MEDIA_DOMAINS = [
'facebook.com',
'twitter.com',
'x.com',
'linkedin.com',
'instagram.com',
'pinterest.com',
'tiktok.com',
'snapchat.com',
'reddit.com',
]
# Threshold for the Image extraction - Range is 1 to 6
# Images are scored based on point based system, to filter based on usefulness. Points are assigned
# to each image based on the following aspects.

View File

@@ -1,196 +0,0 @@
from bs4 import BeautifulSoup, Tag
import re
from typing import Optional
class ContentCleaningStrategy:
def __init__(self):
# Precompile regex patterns for performance
self.negative_patterns = re.compile(r'nav|footer|header|sidebar|ads|comment', re.I)
self.positive_patterns = re.compile(r'content|article|main|post', re.I)
self.priority_tags = {'article', 'main', 'section', 'div'}
self.non_content_tags = {'nav', 'footer', 'header', 'aside'}
# Thresholds
self.text_density_threshold = 9.0
self.min_word_count = 50
self.link_density_threshold = 0.2
self.max_dom_depth = 10 # To prevent excessive DOM traversal
def clean(self, clean_html: str) -> str:
"""
Main function that takes cleaned HTML and returns super cleaned HTML.
Args:
clean_html (str): The cleaned HTML content.
Returns:
str: The super cleaned HTML containing only the main content.
"""
try:
if not clean_html or not isinstance(clean_html, str):
return ''
soup = BeautifulSoup(clean_html, 'html.parser')
main_content = self.extract_main_content(soup)
if main_content:
super_clean_element = self.clean_element(main_content)
return str(super_clean_element)
else:
return ''
except Exception:
# Handle exceptions silently or log them as needed
return ''
def extract_main_content(self, soup: BeautifulSoup) -> Optional[Tag]:
"""
Identifies and extracts the main content element from the HTML.
Args:
soup (BeautifulSoup): The parsed HTML soup.
Returns:
Optional[Tag]: The Tag object containing the main content, or None if not found.
"""
candidates = []
for element in soup.find_all(self.priority_tags):
if self.is_non_content_tag(element):
continue
if self.has_negative_class_id(element):
continue
score = self.calculate_content_score(element)
candidates.append((score, element))
if not candidates:
return None
# Sort candidates by score in descending order
candidates.sort(key=lambda x: x[0], reverse=True)
# Select the element with the highest score
best_element = candidates[0][1]
return best_element
def calculate_content_score(self, element: Tag) -> float:
"""
Calculates a score for an element based on various heuristics.
Args:
element (Tag): The HTML element to score.
Returns:
float: The content score of the element.
"""
score = 0.0
if self.is_priority_tag(element):
score += 5.0
if self.has_positive_class_id(element):
score += 3.0
if self.has_negative_class_id(element):
score -= 3.0
if self.is_high_text_density(element):
score += 2.0
if self.is_low_link_density(element):
score += 2.0
if self.has_sufficient_content(element):
score += 2.0
if self.has_headings(element):
score += 3.0
dom_depth = self.calculate_dom_depth(element)
score += min(dom_depth, self.max_dom_depth) * 0.5 # Adjust weight as needed
return score
def is_priority_tag(self, element: Tag) -> bool:
"""Checks if the element is a priority tag."""
return element.name in self.priority_tags
def is_non_content_tag(self, element: Tag) -> bool:
"""Checks if the element is a non-content tag."""
return element.name in self.non_content_tags
def has_negative_class_id(self, element: Tag) -> bool:
"""Checks if the element has negative indicators in its class or id."""
class_id = ' '.join(filter(None, [
self.get_attr_str(element.get('class')),
element.get('id', '')
]))
return bool(self.negative_patterns.search(class_id))
def has_positive_class_id(self, element: Tag) -> bool:
"""Checks if the element has positive indicators in its class or id."""
class_id = ' '.join(filter(None, [
self.get_attr_str(element.get('class')),
element.get('id', '')
]))
return bool(self.positive_patterns.search(class_id))
@staticmethod
def get_attr_str(attr) -> str:
"""Converts an attribute value to a string."""
if isinstance(attr, list):
return ' '.join(attr)
elif isinstance(attr, str):
return attr
else:
return ''
def is_high_text_density(self, element: Tag) -> bool:
"""Determines if the element has high text density."""
text_density = self.calculate_text_density(element)
return text_density > self.text_density_threshold
def calculate_text_density(self, element: Tag) -> float:
"""Calculates the text density of an element."""
text_length = len(element.get_text(strip=True))
tag_count = len(element.find_all())
tag_count = tag_count or 1 # Prevent division by zero
return text_length / tag_count
def is_low_link_density(self, element: Tag) -> bool:
"""Determines if the element has low link density."""
link_density = self.calculate_link_density(element)
return link_density < self.link_density_threshold
def calculate_link_density(self, element: Tag) -> float:
"""Calculates the link density of an element."""
text = element.get_text(strip=True)
if not text:
return 0.0
link_text = ' '.join(a.get_text(strip=True) for a in element.find_all('a'))
return len(link_text) / len(text) if text else 0.0
def has_sufficient_content(self, element: Tag) -> bool:
"""Checks if the element has sufficient word count."""
word_count = len(element.get_text(strip=True).split())
return word_count >= self.min_word_count
def calculate_dom_depth(self, element: Tag) -> int:
"""Calculates the depth of an element in the DOM tree."""
depth = 0
current_element = element
while current_element.parent and depth < self.max_dom_depth:
depth += 1
current_element = current_element.parent
return depth
def has_headings(self, element: Tag) -> bool:
"""Checks if the element contains heading tags."""
return bool(element.find(['h1', 'h2', 'h3']))
def clean_element(self, element: Tag) -> Tag:
"""
Cleans the selected element by removing unnecessary attributes and nested non-content elements.
Args:
element (Tag): The HTML element to clean.
Returns:
Tag: The cleaned HTML element.
"""
for tag in element.find_all(['script', 'style', 'aside']):
tag.decompose()
for tag in element.find_all():
attrs = dict(tag.attrs)
for attr in attrs:
if attr in ['style', 'onclick', 'onmouseover', 'align', 'bgcolor']:
del tag.attrs[attr]
return element

View File

@@ -7,17 +7,13 @@ from .config import *
from bs4 import element, NavigableString, Comment
from urllib.parse import urljoin
from requests.exceptions import InvalidSchema
from .content_cleaning_strategy import ContentCleaningStrategy
from .utils import (
sanitize_input_encode,
sanitize_html,
extract_metadata,
InvalidCSSSelectorError,
CustomHTML2Text,
normalize_url,
is_external_url
CustomHTML2Text
)
class ContentScrappingStrategy(ABC):
@@ -37,14 +33,12 @@ class WebScrappingStrategy(ContentScrappingStrategy):
return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)
def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
success = True
if not html:
return None
soup = BeautifulSoup(html, 'html.parser')
body = soup.body
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
for tag in kwargs.get('excluded_tags', []) or []:
@@ -70,8 +64,6 @@ class WebScrappingStrategy(ContentScrappingStrategy):
links = {'internal': [], 'external': []}
media = {'images': [], 'videos': [], 'audios': []}
internal_links_dict = {}
external_links_dict = {}
# Extract meaningful text for media files from closest parent
def find_closest_parent_with_useful_text(tag):
@@ -133,11 +125,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
image_width = img.get('width')
width_value, width_unit = parse_dimension(image_width)
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
image_src = img.get('src','')
if "data:image/" in image_src:
image_format = image_src.split(',')[0].split(';')[0].split('/')[1]
else:
image_format = os.path.splitext(img.get('src',''))[1].lower()
image_format = os.path.splitext(img.get('src',''))[1].lower()
# Remove . from format
image_format = image_format.strip('.').split('?')[0]
score = 0
@@ -161,8 +149,6 @@ class WebScrappingStrategy(ContentScrappingStrategy):
score+=1
return score
if not is_valid_image(img, img.parent, img.parent.get('class', [])):
return None
score = score_image_for_usefulness(img, url, index, total_images)
@@ -177,19 +163,6 @@ class WebScrappingStrategy(ContentScrappingStrategy):
'type': 'image'
}
def remove_unwanted_attributes(element, important_attrs, keep_data_attributes=False):
attrs_to_remove = []
for attr in element.attrs:
if attr not in important_attrs:
if keep_data_attributes:
if not attr.startswith('data-'):
attrs_to_remove.append(attr)
else:
attrs_to_remove.append(attr)
for attr in attrs_to_remove:
del element[attr]
def process_element(element: element.PageElement) -> bool:
try:
if isinstance(element, NavigableString):
@@ -206,106 +179,21 @@ class WebScrappingStrategy(ContentScrappingStrategy):
return False
keep_element = False
exclude_social_media_domains = SOCIAL_MEDIA_DOMAINS + kwargs.get('exclude_social_media_domains', [])
exclude_social_media_domains = list(set(exclude_social_media_domains))
try:
if element.name == 'a' and element.get('href'):
href = element.get('href', '').strip()
if not href: # Skip empty hrefs
return False
url_base = url.split('/')[2]
# Normalize the URL
try:
normalized_href = normalize_url(href, url)
except ValueError as e:
# logging.warning(f"Invalid URL format: {href}, Error: {str(e)}")
return False
link_data = {
'href': normalized_href,
'text': element.get_text().strip(),
'title': element.get('title', '').strip()
}
# Check for duplicates and add to appropriate dictionary
is_external = is_external_url(normalized_href, url_base)
if is_external:
if normalized_href not in external_links_dict:
external_links_dict[normalized_href] = link_data
else:
if normalized_href not in internal_links_dict:
internal_links_dict[normalized_href] = link_data
keep_element = True
# Handle external link exclusions
if is_external:
if kwargs.get('exclude_external_links', False):
element.decompose()
return False
elif kwargs.get('exclude_social_media_links', False):
if any(domain in normalized_href.lower() for domain in exclude_social_media_domains):
element.decompose()
return False
elif kwargs.get('exclude_domains', []):
if any(domain in normalized_href.lower() for domain in kwargs.get('exclude_domains', [])):
element.decompose()
return False
except Exception as e:
raise Exception(f"Error processing links: {str(e)}")
if element.name == 'a' and element.get('href'):
href = element['href']
url_base = url.split('/')[2]
link_data = {'href': href, 'text': element.get_text()}
if href.startswith('http') and url_base not in href:
links['external'].append(link_data)
else:
links['internal'].append(link_data)
keep_element = True
try:
if element.name == 'img':
potential_sources = ['src', 'data-src', 'srcset' 'data-lazy-src', 'data-original']
src = element.get('src', '')
while not src and potential_sources:
src = element.get(potential_sources.pop(0), '')
if not src:
element.decompose()
return False
# If it is srcset pick up the first image
if 'srcset' in element.attrs:
src = element.attrs['srcset'].split(',')[0].split(' ')[0]
# Check flag if we should remove external images
if kwargs.get('exclude_external_images', False):
src_url_base = src.split('/')[2]
url_base = url.split('/')[2]
if url_base not in src_url_base:
element.decompose()
return False
if not kwargs.get('exclude_external_images', False) and kwargs.get('exclude_social_media_links', False):
src_url_base = src.split('/')[2]
url_base = url.split('/')[2]
if any(domain in src for domain in exclude_social_media_domains):
element.decompose()
return False
# Handle exclude domains
if kwargs.get('exclude_domains', []):
if any(domain in src for domain in kwargs.get('exclude_domains', [])):
element.decompose()
return False
return True # Always keep image elements
except Exception as e:
raise "Error processing images"
# Check if flag to remove all forms is set
if kwargs.get('remove_forms', False) and element.name == 'form':
element.decompose()
return False
if element.name in ['video', 'audio']:
elif element.name == 'img':
return True # Always keep image elements
elif element.name in ['video', 'audio']:
media[f"{element.name}s"].append({
'src': element.get('src'),
'alt': element.get('alt'),
@@ -322,15 +210,14 @@ class WebScrappingStrategy(ContentScrappingStrategy):
})
return True # Always keep video and audio elements
if element.name in ONLY_TEXT_ELIGIBLE_TAGS:
if kwargs.get('only_text', False):
element.replace_with(element.get_text())
try:
remove_unwanted_attributes(element, IMPORTANT_ATTRS, kwargs.get('keep_data_attributes', False))
except Exception as e:
print('Error removing unwanted attributes:', str(e))
if element.name != 'pre':
if element.name in ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark']:
if kwargs.get('only_text', False):
element.replace_with(element.get_text())
else:
element.unwrap()
elif element.name != 'img':
element.attrs = {}
# Process children
for child in list(element.children):
@@ -364,15 +251,9 @@ class WebScrappingStrategy(ContentScrappingStrategy):
# ]
process_element(body)
# Update the links dictionary with unique links
links['internal'] = list(internal_links_dict.values())
links['external'] = list(external_links_dict.values())
# # Process images using ThreadPoolExecutor
imgs = body.find_all('img')
with ThreadPoolExecutor() as executor:
image_results = list(executor.map(process_image, imgs, [url]*len(imgs), range(len(imgs)), [len(imgs)]*len(imgs)))
media['images'] = [result for result in image_results if result is not None]
@@ -392,42 +273,12 @@ class WebScrappingStrategy(ContentScrappingStrategy):
if base64_pattern.match(src):
# Replace base64 data with empty string
img['src'] = base64_pattern.sub('', src)
try:
str(body)
except Exception as e:
# Reset body to the original HTML
success = False
body = BeautifulSoup(html, 'html.parser')
# Create a new div with a special ID
error_div = body.new_tag('div', id='crawl4ai_error_message')
error_div.string = '''
Crawl4AI Error: This page is not fully supported.
Possible reasons:
1. The page may have restrictions that prevent crawling.
2. The page might not be fully loaded.
Suggestions:
- Try calling the crawl function with these parameters:
magic=True,
- Set headless=False to visualize what's happening on the page.
If the issue persists, please check the page's structure and any potential anti-crawling measures.
'''
# Append the error div to the body
body.body.append(error_div)
print(f"[LOG] 😧 Error: After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.")
cleaned_html = str(body).replace('\n\n', '\n').replace(' ', ' ')
h = CustomHTML2Text()
h.ignore_links = True
h.body_width = 0
try:
h = CustomHTML2Text()
h.update_params(**kwargs.get('html2text', {}))
markdown = h.handle(cleaned_html)
except Exception as e:
markdown = h.handle(sanitize_html(cleaned_html))
@@ -438,18 +289,12 @@ class WebScrappingStrategy(ContentScrappingStrategy):
except Exception as e:
print('Error extracting metadata:', str(e))
meta = {}
cleaner = ContentCleaningStrategy()
fit_html = cleaner.clean(cleaned_html)
fit_markdown = h.handle(fit_html)
cleaned_html = sanitize_html(cleaned_html)
return {
'markdown': markdown,
'fit_markdown': fit_markdown,
'fit_html': fit_html,
'cleaned_html': cleaned_html,
'success': success,
'success': True,
'media': media,
'links': links,
'metadata': meta

View File

@@ -68,7 +68,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
"""
super().__init__()
self.provider = provider
self.api_token = api_token or PROVIDER_MODELS.get(provider, "no-token") or os.getenv("OPENAI_API_KEY")
self.api_token = api_token or PROVIDER_MODELS.get(provider, None) or os.getenv("OPENAI_API_KEY")
self.instruction = instruction
self.extract_type = extraction_type
self.schema = schema
@@ -80,7 +80,6 @@ class LLMExtractionStrategy(ExtractionStrategy):
self.word_token_rate = kwargs.get("word_token_rate", WORD_TOKEN_RATE)
self.apply_chunking = kwargs.get("apply_chunking", True)
self.base_url = kwargs.get("base_url", None)
self.api_base = kwargs.get("api_base", kwargs.get("base_url", None))
self.extra_args = kwargs.get("extra_args", {})
if not self.apply_chunking:
self.chunk_token_threshold = 1e9
@@ -117,7 +116,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
self.provider,
prompt_with_variables,
self.api_token,
base_url=self.api_base or self.base_url,
base_url=self.base_url,
extra_args = self.extra_args
) # , json_response=self.extract_type == "schema")
try:
@@ -235,12 +234,11 @@ class CosineStrategy(ExtractionStrategy):
"""
Initialize the strategy with clustering parameters.
Args:
semantic_filter (str): A keyword filter for document filtering.
word_count_threshold (int): Minimum number of words per cluster.
max_dist (float): The maximum cophenetic distance on the dendrogram to form clusters.
linkage_method (str): The linkage method for hierarchical clustering.
top_k (int): Number of top categories to extract.
:param semantic_filter: A keyword filter for document filtering.
:param word_count_threshold: Minimum number of words per cluster.
:param max_dist: The maximum cophenetic distance on the dendrogram to form clusters.
:param linkage_method: The linkage method for hierarchical clustering.
:param top_k: Number of top categories to extract.
"""
super().__init__()
@@ -259,8 +257,8 @@ class CosineStrategy(ExtractionStrategy):
self.get_embedding_method = "direct"
self.device = get_device()
# import torch
# self.device = torch.device('cpu')
import torch
self.device = torch.device('cpu')
self.default_batch_size = calculate_batch_size(self.device)
@@ -273,7 +271,7 @@ class CosineStrategy(ExtractionStrategy):
# self.get_embedding_method = "direct"
# else:
self.tokenizer, self.model = load_HF_embedding_model(model_name)
self.tokenizer, self.model = load_bge_small_en_v1_5()
self.model.to(self.device)
self.model.eval()
@@ -740,6 +738,7 @@ class JsonCssExtractionStrategy(ExtractionStrategy):
combined_html = self.DEL.join(sections)
return self.extract(url, combined_html, **kwargs)
class JsonXPATHExtractionStrategy(ExtractionStrategy):
def __init__(self, schema: Dict[str, Any], **kwargs):
super().__init__(**kwargs)

File diff suppressed because it is too large Load Diff

View File

@@ -1,3 +0,0 @@
from .cli import main
main()

View File

@@ -1,2 +0,0 @@
class OutCallback:
def __call__(self, s: str) -> None: ...

View File

@@ -1,330 +0,0 @@
import argparse
import sys
from . import HTML2Text, __version__, config
def main() -> None:
baseurl = ""
class bcolors:
HEADER = "\033[95m"
OKBLUE = "\033[94m"
OKGREEN = "\033[92m"
WARNING = "\033[93m"
FAIL = "\033[91m"
ENDC = "\033[0m"
BOLD = "\033[1m"
UNDERLINE = "\033[4m"
p = argparse.ArgumentParser()
p.add_argument(
"--default-image-alt",
dest="default_image_alt",
default=config.DEFAULT_IMAGE_ALT,
help="The default alt string for images with missing ones",
)
p.add_argument(
"--pad-tables",
dest="pad_tables",
action="store_true",
default=config.PAD_TABLES,
help="pad the cells to equal column width in tables",
)
p.add_argument(
"--no-wrap-links",
dest="wrap_links",
action="store_false",
default=config.WRAP_LINKS,
help="don't wrap links during conversion",
)
p.add_argument(
"--wrap-list-items",
dest="wrap_list_items",
action="store_true",
default=config.WRAP_LIST_ITEMS,
help="wrap list items during conversion",
)
p.add_argument(
"--wrap-tables",
dest="wrap_tables",
action="store_true",
default=config.WRAP_TABLES,
help="wrap tables",
)
p.add_argument(
"--ignore-emphasis",
dest="ignore_emphasis",
action="store_true",
default=config.IGNORE_EMPHASIS,
help="don't include any formatting for emphasis",
)
p.add_argument(
"--reference-links",
dest="inline_links",
action="store_false",
default=config.INLINE_LINKS,
help="use reference style links instead of inline links",
)
p.add_argument(
"--ignore-links",
dest="ignore_links",
action="store_true",
default=config.IGNORE_ANCHORS,
help="don't include any formatting for links",
)
p.add_argument(
"--ignore-mailto-links",
action="store_true",
dest="ignore_mailto_links",
default=config.IGNORE_MAILTO_LINKS,
help="don't include mailto: links",
)
p.add_argument(
"--protect-links",
dest="protect_links",
action="store_true",
default=config.PROTECT_LINKS,
help="protect links from line breaks surrounding them with angle brackets",
)
p.add_argument(
"--ignore-images",
dest="ignore_images",
action="store_true",
default=config.IGNORE_IMAGES,
help="don't include any formatting for images",
)
p.add_argument(
"--images-as-html",
dest="images_as_html",
action="store_true",
default=config.IMAGES_AS_HTML,
help=(
"Always write image tags as raw html; preserves `height`, `width` and "
"`alt` if possible."
),
)
p.add_argument(
"--images-to-alt",
dest="images_to_alt",
action="store_true",
default=config.IMAGES_TO_ALT,
help="Discard image data, only keep alt text",
)
p.add_argument(
"--images-with-size",
dest="images_with_size",
action="store_true",
default=config.IMAGES_WITH_SIZE,
help=(
"Write image tags with height and width attrs as raw html to retain "
"dimensions"
),
)
p.add_argument(
"-g",
"--google-doc",
action="store_true",
dest="google_doc",
default=False,
help="convert an html-exported Google Document",
)
p.add_argument(
"-d",
"--dash-unordered-list",
action="store_true",
dest="ul_style_dash",
default=False,
help="use a dash rather than a star for unordered list items",
)
p.add_argument(
"-e",
"--asterisk-emphasis",
action="store_true",
dest="em_style_asterisk",
default=False,
help="use an asterisk rather than an underscore for emphasized text",
)
p.add_argument(
"-b",
"--body-width",
dest="body_width",
type=int,
default=config.BODY_WIDTH,
help="number of characters per output line, 0 for no wrap",
)
p.add_argument(
"-i",
"--google-list-indent",
dest="list_indent",
type=int,
default=config.GOOGLE_LIST_INDENT,
help="number of pixels Google indents nested lists",
)
p.add_argument(
"-s",
"--hide-strikethrough",
action="store_true",
dest="hide_strikethrough",
default=False,
help="hide strike-through text. only relevant when -g is " "specified as well",
)
p.add_argument(
"--escape-all",
action="store_true",
dest="escape_snob",
default=False,
help=(
"Escape all special characters. Output is less readable, but avoids "
"corner case formatting issues."
),
)
p.add_argument(
"--bypass-tables",
action="store_true",
dest="bypass_tables",
default=config.BYPASS_TABLES,
help="Format tables in HTML rather than Markdown syntax.",
)
p.add_argument(
"--ignore-tables",
action="store_true",
dest="ignore_tables",
default=config.IGNORE_TABLES,
help="Ignore table-related tags (table, th, td, tr) " "while keeping rows.",
)
p.add_argument(
"--single-line-break",
action="store_true",
dest="single_line_break",
default=config.SINGLE_LINE_BREAK,
help=(
"Use a single line break after a block element rather than two line "
"breaks. NOTE: Requires --body-width=0"
),
)
p.add_argument(
"--unicode-snob",
action="store_true",
dest="unicode_snob",
default=config.UNICODE_SNOB,
help="Use unicode throughout document",
)
p.add_argument(
"--no-automatic-links",
action="store_false",
dest="use_automatic_links",
default=config.USE_AUTOMATIC_LINKS,
help="Do not use automatic links wherever applicable",
)
p.add_argument(
"--no-skip-internal-links",
action="store_false",
dest="skip_internal_links",
default=config.SKIP_INTERNAL_LINKS,
help="Do not skip internal links",
)
p.add_argument(
"--links-after-para",
action="store_true",
dest="links_each_paragraph",
default=config.LINKS_EACH_PARAGRAPH,
help="Put links after each paragraph instead of document",
)
p.add_argument(
"--mark-code",
action="store_true",
dest="mark_code",
default=config.MARK_CODE,
help="Mark program code blocks with [code]...[/code]",
)
p.add_argument(
"--decode-errors",
dest="decode_errors",
default=config.DECODE_ERRORS,
help=(
"What to do in case of decode errors.'ignore', 'strict' and 'replace' are "
"acceptable values"
),
)
p.add_argument(
"--open-quote",
dest="open_quote",
default=config.OPEN_QUOTE,
help="The character used to open quotes",
)
p.add_argument(
"--close-quote",
dest="close_quote",
default=config.CLOSE_QUOTE,
help="The character used to close quotes",
)
p.add_argument(
"--version", action="version", version=".".join(map(str, __version__))
)
p.add_argument("filename", nargs="?")
p.add_argument("encoding", nargs="?", default="utf-8")
p.add_argument(
"--include-sup-sub",
dest="include_sup_sub",
action="store_true",
default=config.INCLUDE_SUP_SUB,
help="Include the sup and sub tags",
)
args = p.parse_args()
if args.filename and args.filename != "-":
with open(args.filename, "rb") as fp:
data = fp.read()
else:
data = sys.stdin.buffer.read()
try:
html = data.decode(args.encoding, args.decode_errors)
except UnicodeDecodeError as err:
warning = bcolors.WARNING + "Warning:" + bcolors.ENDC
warning += " Use the " + bcolors.OKGREEN
warning += "--decode-errors=ignore" + bcolors.ENDC + " flag."
print(warning)
raise err
h = HTML2Text(baseurl=baseurl)
# handle options
if args.ul_style_dash:
h.ul_item_mark = "-"
if args.em_style_asterisk:
h.emphasis_mark = "*"
h.strong_mark = "__"
h.body_width = args.body_width
h.google_list_indent = args.list_indent
h.ignore_emphasis = args.ignore_emphasis
h.ignore_links = args.ignore_links
h.ignore_mailto_links = args.ignore_mailto_links
h.protect_links = args.protect_links
h.ignore_images = args.ignore_images
h.images_as_html = args.images_as_html
h.images_to_alt = args.images_to_alt
h.images_with_size = args.images_with_size
h.google_doc = args.google_doc
h.hide_strikethrough = args.hide_strikethrough
h.escape_snob = args.escape_snob
h.bypass_tables = args.bypass_tables
h.ignore_tables = args.ignore_tables
h.single_line_break = args.single_line_break
h.inline_links = args.inline_links
h.unicode_snob = args.unicode_snob
h.use_automatic_links = args.use_automatic_links
h.skip_internal_links = args.skip_internal_links
h.links_each_paragraph = args.links_each_paragraph
h.mark_code = args.mark_code
h.wrap_links = args.wrap_links
h.wrap_list_items = args.wrap_list_items
h.wrap_tables = args.wrap_tables
h.pad_tables = args.pad_tables
h.default_image_alt = args.default_image_alt
h.open_quote = args.open_quote
h.close_quote = args.close_quote
h.include_sup_sub = args.include_sup_sub
sys.stdout.write(h.handle(html))

View File

@@ -1,172 +0,0 @@
import re
# Use Unicode characters instead of their ascii pseudo-replacements
UNICODE_SNOB = False
# Marker to use for marking tables for padding post processing
TABLE_MARKER_FOR_PAD = "special_marker_for_table_padding"
# Escape all special characters. Output is less readable, but avoids
# corner case formatting issues.
ESCAPE_SNOB = False
ESCAPE_BACKSLASH = False
ESCAPE_DOT = False
ESCAPE_PLUS = False
ESCAPE_DASH = False
# Put the links after each paragraph instead of at the end.
LINKS_EACH_PARAGRAPH = False
# Wrap long lines at position. 0 for no wrapping.
BODY_WIDTH = 78
# Don't show internal links (href="#local-anchor") -- corresponding link
# targets won't be visible in the plain text file anyway.
SKIP_INTERNAL_LINKS = True
# Use inline, rather than reference, formatting for images and links
INLINE_LINKS = True
# Protect links from line breaks surrounding them with angle brackets (in
# addition to their square brackets)
PROTECT_LINKS = False
# WRAP_LINKS = True
WRAP_LINKS = True
# Wrap list items.
WRAP_LIST_ITEMS = False
# Wrap tables
WRAP_TABLES = False
# Number of pixels Google indents nested lists
GOOGLE_LIST_INDENT = 36
# Values Google and others may use to indicate bold text
BOLD_TEXT_STYLE_VALUES = ("bold", "700", "800", "900")
IGNORE_ANCHORS = False
IGNORE_MAILTO_LINKS = False
IGNORE_IMAGES = False
IMAGES_AS_HTML = False
IMAGES_TO_ALT = False
IMAGES_WITH_SIZE = False
IGNORE_EMPHASIS = False
MARK_CODE = False
DECODE_ERRORS = "strict"
DEFAULT_IMAGE_ALT = ""
PAD_TABLES = False
# Convert links with same href and text to <href> format
# if they are absolute links
USE_AUTOMATIC_LINKS = True
# For checking space-only lines on line 771
RE_SPACE = re.compile(r"\s\+")
RE_ORDERED_LIST_MATCHER = re.compile(r"\d+\.\s")
RE_UNORDERED_LIST_MATCHER = re.compile(r"[-\*\+]\s")
RE_MD_CHARS_MATCHER = re.compile(r"([\\\[\]\(\)])")
RE_MD_CHARS_MATCHER_ALL = re.compile(r"([`\*_{}\[\]\(\)#!])")
# to find links in the text
RE_LINK = re.compile(r"(\[.*?\] ?\(.*?\))|(\[.*?\]:.*?)")
# to find table separators
RE_TABLE = re.compile(r" \| ")
RE_MD_DOT_MATCHER = re.compile(
r"""
^ # start of line
(\s*\d+) # optional whitespace and a number
(\.) # dot
(?=\s) # lookahead assert whitespace
""",
re.MULTILINE | re.VERBOSE,
)
RE_MD_PLUS_MATCHER = re.compile(
r"""
^
(\s*)
(\+)
(?=\s)
""",
flags=re.MULTILINE | re.VERBOSE,
)
RE_MD_DASH_MATCHER = re.compile(
r"""
^
(\s*)
(-)
(?=\s|\-) # followed by whitespace (bullet list, or spaced out hr)
# or another dash (header or hr)
""",
flags=re.MULTILINE | re.VERBOSE,
)
RE_SLASH_CHARS = r"\`*_{}[]()#+-.!"
RE_MD_BACKSLASH_MATCHER = re.compile(
r"""
(\\) # match one slash
(?=[%s]) # followed by a char that requires escaping
"""
% re.escape(RE_SLASH_CHARS),
flags=re.VERBOSE,
)
UNIFIABLE = {
"rsquo": "'",
"lsquo": "'",
"rdquo": '"',
"ldquo": '"',
"copy": "(C)",
"mdash": "--",
"nbsp": " ",
"rarr": "->",
"larr": "<-",
"middot": "*",
"ndash": "-",
"oelig": "oe",
"aelig": "ae",
"agrave": "a",
"aacute": "a",
"acirc": "a",
"atilde": "a",
"auml": "a",
"aring": "a",
"egrave": "e",
"eacute": "e",
"ecirc": "e",
"euml": "e",
"igrave": "i",
"iacute": "i",
"icirc": "i",
"iuml": "i",
"ograve": "o",
"oacute": "o",
"ocirc": "o",
"otilde": "o",
"ouml": "o",
"ugrave": "u",
"uacute": "u",
"ucirc": "u",
"uuml": "u",
"lrm": "",
"rlm": "",
}
# Format tables in HTML rather than Markdown syntax
BYPASS_TABLES = False
# Ignore table-related tags (table, th, td, tr) while keeping rows
IGNORE_TABLES = False
# Use a single line break after a block element rather than two line breaks.
# NOTE: Requires body width setting to be 0.
SINGLE_LINE_BREAK = False
# Use double quotation marks when converting the <q> tag.
OPEN_QUOTE = '"'
CLOSE_QUOTE = '"'
# Include the <sup> and <sub> tags
INCLUDE_SUP_SUB = False

View File

@@ -1,18 +0,0 @@
from typing import Dict, Optional
class AnchorElement:
__slots__ = ["attrs", "count", "outcount"]
def __init__(self, attrs: Dict[str, Optional[str]], count: int, outcount: int):
self.attrs = attrs
self.count = count
self.outcount = outcount
class ListElement:
__slots__ = ["name", "num"]
def __init__(self, name: str, num: int):
self.name = name
self.num = num

View File

@@ -1,303 +0,0 @@
import html.entities
from typing import Dict, List, Optional
from . import config
unifiable_n = {
html.entities.name2codepoint[k]: v
for k, v in config.UNIFIABLE.items()
if k != "nbsp"
}
def hn(tag: str) -> int:
if tag[0] == "h" and len(tag) == 2:
n = tag[1]
if "0" < n <= "9":
return int(n)
return 0
def dumb_property_dict(style: str) -> Dict[str, str]:
"""
:returns: A hash of css attributes
"""
return {
x.strip().lower(): y.strip().lower()
for x, y in [z.split(":", 1) for z in style.split(";") if ":" in z]
}
def dumb_css_parser(data: str) -> Dict[str, Dict[str, str]]:
"""
:type data: str
:returns: A hash of css selectors, each of which contains a hash of
css attributes.
:rtype: dict
"""
# remove @import sentences
data += ";"
importIndex = data.find("@import")
while importIndex != -1:
data = data[0:importIndex] + data[data.find(";", importIndex) + 1 :]
importIndex = data.find("@import")
# parse the css. reverted from dictionary comprehension in order to
# support older pythons
pairs = [x.split("{") for x in data.split("}") if "{" in x.strip()]
try:
elements = {a.strip(): dumb_property_dict(b) for a, b in pairs}
except ValueError:
elements = {} # not that important
return elements
def element_style(
attrs: Dict[str, Optional[str]],
style_def: Dict[str, Dict[str, str]],
parent_style: Dict[str, str],
) -> Dict[str, str]:
"""
:type attrs: dict
:type style_def: dict
:type style_def: dict
:returns: A hash of the 'final' style attributes of the element
:rtype: dict
"""
style = parent_style.copy()
if "class" in attrs:
assert attrs["class"] is not None
for css_class in attrs["class"].split():
css_style = style_def.get("." + css_class, {})
style.update(css_style)
if "style" in attrs:
assert attrs["style"] is not None
immediate_style = dumb_property_dict(attrs["style"])
style.update(immediate_style)
return style
def google_list_style(style: Dict[str, str]) -> str:
"""
Finds out whether this is an ordered or unordered list
:type style: dict
:rtype: str
"""
if "list-style-type" in style:
list_style = style["list-style-type"]
if list_style in ["disc", "circle", "square", "none"]:
return "ul"
return "ol"
def google_has_height(style: Dict[str, str]) -> bool:
"""
Check if the style of the element has the 'height' attribute
explicitly defined
:type style: dict
:rtype: bool
"""
return "height" in style
def google_text_emphasis(style: Dict[str, str]) -> List[str]:
"""
:type style: dict
:returns: A list of all emphasis modifiers of the element
:rtype: list
"""
emphasis = []
if "text-decoration" in style:
emphasis.append(style["text-decoration"])
if "font-style" in style:
emphasis.append(style["font-style"])
if "font-weight" in style:
emphasis.append(style["font-weight"])
return emphasis
def google_fixed_width_font(style: Dict[str, str]) -> bool:
"""
Check if the css of the current element defines a fixed width font
:type style: dict
:rtype: bool
"""
font_family = ""
if "font-family" in style:
font_family = style["font-family"]
return "courier new" == font_family or "consolas" == font_family
def list_numbering_start(attrs: Dict[str, Optional[str]]) -> int:
"""
Extract numbering from list element attributes
:type attrs: dict
:rtype: int or None
"""
if "start" in attrs:
assert attrs["start"] is not None
try:
return int(attrs["start"]) - 1
except ValueError:
pass
return 0
def skipwrap(
para: str, wrap_links: bool, wrap_list_items: bool, wrap_tables: bool
) -> bool:
# If it appears to contain a link
# don't wrap
if not wrap_links and config.RE_LINK.search(para):
return True
# If the text begins with four spaces or one tab, it's a code block;
# don't wrap
if para[0:4] == " " or para[0] == "\t":
return True
# If the text begins with only two "--", possibly preceded by
# whitespace, that's an emdash; so wrap.
stripped = para.lstrip()
if stripped[0:2] == "--" and len(stripped) > 2 and stripped[2] != "-":
return False
# I'm not sure what this is for; I thought it was to detect lists,
# but there's a <br>-inside-<span> case in one of the tests that
# also depends upon it.
if stripped[0:1] in ("-", "*") and not stripped[0:2] == "**":
return not wrap_list_items
# If text contains a pipe character it is likely a table
if not wrap_tables and config.RE_TABLE.search(para):
return True
# If the text begins with a single -, *, or +, followed by a space,
# or an integer, followed by a ., followed by a space (in either
# case optionally proceeded by whitespace), it's a list; don't wrap.
return bool(
config.RE_ORDERED_LIST_MATCHER.match(stripped)
or config.RE_UNORDERED_LIST_MATCHER.match(stripped)
)
def escape_md(text: str) -> str:
"""
Escapes markdown-sensitive characters within other markdown
constructs.
"""
return config.RE_MD_CHARS_MATCHER.sub(r"\\\1", text)
def escape_md_section(
text: str,
escape_backslash: bool = True,
snob: bool = False,
escape_dot: bool = True,
escape_plus: bool = True,
escape_dash: bool = True
) -> str:
"""
Escapes markdown-sensitive characters across whole document sections.
Each escaping operation can be controlled individually.
"""
if escape_backslash:
text = config.RE_MD_BACKSLASH_MATCHER.sub(r"\\\1", text)
if snob:
text = config.RE_MD_CHARS_MATCHER_ALL.sub(r"\\\1", text)
if escape_dot:
text = config.RE_MD_DOT_MATCHER.sub(r"\1\\\2", text)
if escape_plus:
text = config.RE_MD_PLUS_MATCHER.sub(r"\1\\\2", text)
if escape_dash:
text = config.RE_MD_DASH_MATCHER.sub(r"\1\\\2", text)
return text
def reformat_table(lines: List[str], right_margin: int) -> List[str]:
"""
Given the lines of a table
padds the cells and returns the new lines
"""
# find the maximum width of the columns
max_width = [len(x.rstrip()) + right_margin for x in lines[0].split("|")]
max_cols = len(max_width)
for line in lines:
cols = [x.rstrip() for x in line.split("|")]
num_cols = len(cols)
# don't drop any data if colspan attributes result in unequal lengths
if num_cols < max_cols:
cols += [""] * (max_cols - num_cols)
elif max_cols < num_cols:
max_width += [len(x) + right_margin for x in cols[-(num_cols - max_cols) :]]
max_cols = num_cols
max_width = [
max(len(x) + right_margin, old_len) for x, old_len in zip(cols, max_width)
]
# reformat
new_lines = []
for line in lines:
cols = [x.rstrip() for x in line.split("|")]
if set(line.strip()) == set("-|"):
filler = "-"
new_cols = [
x.rstrip() + (filler * (M - len(x.rstrip())))
for x, M in zip(cols, max_width)
]
new_lines.append("|-" + "|".join(new_cols) + "|")
else:
filler = " "
new_cols = [
x.rstrip() + (filler * (M - len(x.rstrip())))
for x, M in zip(cols, max_width)
]
new_lines.append("| " + "|".join(new_cols) + "|")
return new_lines
def pad_tables_in_text(text: str, right_margin: int = 1) -> str:
"""
Provide padding for tables in the text
"""
lines = text.split("\n")
table_buffer = [] # type: List[str]
table_started = False
new_lines = []
for line in lines:
# Toggle table started
if config.TABLE_MARKER_FOR_PAD in line:
table_started = not table_started
if not table_started:
table = reformat_table(table_buffer, right_margin)
new_lines.extend(table)
table_buffer = []
new_lines.append("")
continue
# Process lines
if table_started:
table_buffer.append(line)
else:
new_lines.append(line)
return "\n".join(new_lines)

View File

@@ -72,18 +72,10 @@ def load_bert_base_uncased():
return tokenizer, model
@lru_cache()
def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple:
"""Load the Hugging Face model for embedding.
Args:
model_name (str, optional): The model name to load. Defaults to "BAAI/bge-small-en-v1.5".
Returns:
tuple: The tokenizer and model.
"""
def load_bge_small_en_v1_5():
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(model_name, resume_download=None)
model = AutoModel.from_pretrained(model_name, resume_download=None)
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
model = AutoModel.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
model.eval()
model, device = set_model_device(model)
return tokenizer, model

View File

@@ -14,8 +14,6 @@ class CrawlResult(BaseModel):
links: Dict[str, List[Dict]] = {}
screenshot: Optional[str] = None
markdown: Optional[str] = None
fit_markdown: Optional[str] = None
fit_html: Optional[str] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
error_message: Optional[str] = None

View File

@@ -0,0 +1,3 @@
from .async_web_scraper import AsyncWebScraper
from .bfs_scraper_strategy import BFSScraperStrategy
from .filters import URLFilter, FilterChain, URLPatternFilter, ContentTypeFilter

View File

@@ -0,0 +1,123 @@
from typing import Union, AsyncGenerator, Optional
from .scraper_strategy import ScraperStrategy
from .models import ScraperResult, CrawlResult
from ..async_webcrawler import AsyncWebCrawler
import logging
from dataclasses import dataclass
from contextlib import asynccontextmanager
@dataclass
class ScrapingProgress:
"""Tracks the progress of a scraping operation."""
processed_urls: int = 0
failed_urls: int = 0
current_url: Optional[str] = None
class AsyncWebScraper:
"""
A high-level web scraper that combines an async crawler with a scraping strategy.
Args:
crawler (AsyncWebCrawler): The async web crawler implementation
strategy (ScraperStrategy): The scraping strategy to use
logger (Optional[logging.Logger]): Custom logger for the scraper
"""
def __init__(
self,
crawler: AsyncWebCrawler,
strategy: ScraperStrategy,
logger: Optional[logging.Logger] = None
):
if not isinstance(crawler, AsyncWebCrawler):
raise TypeError("crawler must be an instance of AsyncWebCrawler")
if not isinstance(strategy, ScraperStrategy):
raise TypeError("strategy must be an instance of ScraperStrategy")
self.crawler = crawler
self.strategy = strategy
self.logger = logger or logging.getLogger(__name__)
self._progress = ScrapingProgress()
@property
def progress(self) -> ScrapingProgress:
"""Get current scraping progress."""
return self._progress
@asynccontextmanager
async def _error_handling_context(self, url: str):
"""Context manager for handling errors during scraping."""
try:
yield
except Exception as e:
self.logger.error(f"Error scraping {url}: {str(e)}")
self._progress.failed_urls += 1
raise
async def ascrape(
self,
url: str,
parallel_processing: bool = True,
stream: bool = False
) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
"""
Scrape a website starting from the given URL.
Args:
url: Starting URL for scraping
parallel_processing: Whether to process URLs in parallel
stream: If True, yield results as they come; if False, collect all results
Returns:
Either an async generator yielding CrawlResults or a final ScraperResult
"""
self._progress = ScrapingProgress() # Reset progress
async with self._error_handling_context(url):
if stream:
return self._ascrape_yielding(url, parallel_processing)
return await self._ascrape_collecting(url, parallel_processing)
async def _ascrape_yielding(
self,
url: str,
parallel_processing: bool
) -> AsyncGenerator[CrawlResult, None]:
"""Stream scraping results as they become available."""
try:
result_generator = self.strategy.ascrape(url, self.crawler, parallel_processing)
async for res in result_generator:
self._progress.processed_urls += 1
self._progress.current_url = res.url
yield res
except Exception as e:
self.logger.error(f"Error in streaming scrape: {str(e)}")
raise
async def _ascrape_collecting(
self,
url: str,
parallel_processing: bool
) -> ScraperResult:
"""Collect all scraping results before returning."""
extracted_data = {}
try:
result_generator = self.strategy.ascrape(url, self.crawler, parallel_processing)
async for res in result_generator:
self._progress.processed_urls += 1
self._progress.current_url = res.url
extracted_data[res.url] = res
return ScraperResult(
url=url,
crawled_urls=list(extracted_data.keys()),
extracted_data=extracted_data,
stats={
'processed_urls': self._progress.processed_urls,
'failed_urls': self._progress.failed_urls
}
)
except Exception as e:
self.logger.error(f"Error in collecting scrape: {str(e)}")
raise

View File

@@ -0,0 +1,327 @@
from abc import ABC, abstractmethod
from typing import Union, AsyncGenerator, Optional, Dict, Set
from dataclasses import dataclass
from datetime import datetime
import asyncio
import logging
from urllib.parse import urljoin, urlparse, urlunparse
from urllib.robotparser import RobotFileParser
import validators
import time
from aiolimiter import AsyncLimiter
from tenacity import retry, stop_after_attempt, wait_exponential
from collections import defaultdict
from .models import ScraperResult, CrawlResult
from .filters import FilterChain
from .scorers import URLScorer
from ..async_webcrawler import AsyncWebCrawler
@dataclass
class CrawlStats:
"""Statistics for the crawling process"""
start_time: datetime
urls_processed: int = 0
urls_failed: int = 0
urls_skipped: int = 0
total_depth_reached: int = 0
current_depth: int = 0
robots_blocked: int = 0
class ScraperStrategy(ABC):
"""Base class for scraping strategies"""
@abstractmethod
async def ascrape(
self,
url: str,
crawler: AsyncWebCrawler,
parallel_processing: bool = True,
stream: bool = False
) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
"""Abstract method for scraping implementation"""
pass
@abstractmethod
async def can_process_url(self, url: str) -> bool:
"""Check if URL can be processed based on strategy rules"""
pass
@abstractmethod
async def shutdown(self):
"""Clean up resources used by the strategy"""
pass
class BFSScraperStrategy(ScraperStrategy):
"""Breadth-First Search scraping strategy with politeness controls"""
def __init__(
self,
max_depth: int,
filter_chain: FilterChain,
url_scorer: URLScorer,
max_concurrent: int = 5,
min_crawl_delay: int = 1,
timeout: int = 30,
logger: Optional[logging.Logger] = None
):
self.max_depth = max_depth
self.filter_chain = filter_chain
self.url_scorer = url_scorer
self.max_concurrent = max_concurrent
self.min_crawl_delay = min_crawl_delay
self.timeout = timeout
self.logger = logger or logging.getLogger(__name__)
# Crawl control
self.stats = CrawlStats(start_time=datetime.now())
self._cancel_event = asyncio.Event()
self.process_external_links = False
# Rate limiting and politeness
self.rate_limiter = AsyncLimiter(1, 1)
self.last_crawl_time = defaultdict(float)
self.robot_parsers: Dict[str, RobotFileParser] = {}
self.domain_queues: Dict[str, asyncio.Queue] = defaultdict(asyncio.Queue)
async def can_process_url(self, url: str) -> bool:
"""Check if URL can be processed based on robots.txt and filters
This is our gatekeeper method that determines if a URL should be processed. It:
- Validates URL format using the validators library
- Checks robots.txt permissions for the domain
- Applies custom filters from the filter chain
- Updates statistics for blocked URLs
- Returns False early if any check fails
"""
if not validators.url(url):
self.logger.warning(f"Invalid URL: {url}")
return False
robot_parser = await self._get_robot_parser(url)
if robot_parser and not robot_parser.can_fetch("*", url):
self.stats.robots_blocked += 1
self.logger.info(f"Blocked by robots.txt: {url}")
return False
return self.filter_chain.apply(url)
async def _get_robot_parser(self, url: str) -> Optional[RobotFileParser]:
"""Get or create robots.txt parser for domain.
This is our robots.txt manager that:
- Uses domain-level caching of robot parsers
- Creates and caches new parsers as needed
- Handles failed robots.txt fetches gracefully
- Returns None if robots.txt can't be fetched, allowing crawling to proceed
"""
domain = urlparse(url).netloc
if domain not in self.robot_parsers:
parser = RobotFileParser()
try:
robots_url = f"{urlparse(url).scheme}://{domain}/robots.txt"
parser.set_url(robots_url)
parser.read()
self.robot_parsers[domain] = parser
except Exception as e:
self.logger.warning(f"Error fetching robots.txt for {domain}: {e}")
return None
return self.robot_parsers[domain]
@retry(stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10))
async def _crawl_with_retry(
self,
crawler: AsyncWebCrawler,
url: str
) -> CrawlResult:
"""Crawl URL with retry logic"""
try:
async with asyncio.timeout(self.timeout):
return await crawler.arun(url)
except asyncio.TimeoutError:
self.logger.error(f"Timeout crawling {url}")
raise
async def process_url(
self,
url: str,
depth: int,
crawler: AsyncWebCrawler,
queue: asyncio.PriorityQueue,
visited: Set[str],
depths: Dict[str, int]
) -> Optional[CrawlResult]:
"""Process a single URL and extract links.
This is our main URL processing workhorse that:
- Checks for cancellation
- Validates URLs through can_process_url
- Implements politeness delays per domain
- Applies rate limiting
- Handles crawling with retries
- Updates various statistics
- Processes extracted links
- Returns the crawl result or None on failure
"""
if self._cancel_event.is_set():
return None
if not await self.can_process_url(url):
self.stats.urls_skipped += 1
return None
# Politeness delay
domain = urlparse(url).netloc
time_since_last = time.time() - self.last_crawl_time[domain]
if time_since_last < self.min_crawl_delay:
await asyncio.sleep(self.min_crawl_delay - time_since_last)
self.last_crawl_time[domain] = time.time()
# Crawl with rate limiting
try:
async with self.rate_limiter:
result = await self._crawl_with_retry(crawler, url)
self.stats.urls_processed += 1
except Exception as e:
self.logger.error(f"Error crawling {url}: {e}")
self.stats.urls_failed += 1
return None
# Process links
await self._process_links(result, url, depth, queue, visited, depths)
return result
async def _process_links(
self,
result: CrawlResult,
source_url: str,
depth: int,
queue: asyncio.PriorityQueue,
visited: Set[str],
depths: Dict[str, int]
):
"""Process extracted links from crawl result.
This is our link processor that:
Handles both internal and external links
Normalizes URLs (removes fragments)
Checks depth limits
Scores URLs for priority
Updates depth tracking
Adds valid URLs to the queue
Updates maximum depth statistics
"""
links_ro_process = result.links["internal"]
if self.process_external_links:
links_ro_process += result.links["external"]
for link_type in links_ro_process:
for link in result.links[link_type]:
url = link['href']
# url = urljoin(source_url, link['href'])
# url = urlunparse(urlparse(url)._replace(fragment=""))
if url not in visited and await self.can_process_url(url):
new_depth = depths[source_url] + 1
if new_depth <= self.max_depth:
score = self.url_scorer.score(url)
await queue.put((score, new_depth, url))
depths[url] = new_depth
self.stats.total_depth_reached = max(
self.stats.total_depth_reached,
new_depth
)
async def ascrape(
self,
start_url: str,
crawler: AsyncWebCrawler,
parallel_processing: bool = True
) -> AsyncGenerator[CrawlResult, None]:
"""Implement BFS crawling strategy"""
# Initialize crawl state
"""
queue: A priority queue where items are tuples of (score, depth, url)
Score: Determines crawling priority (lower = higher priority)
Depth: Current distance from start_url
URL: The actual URL to crawl
visited: Keeps track of URLs we've already seen to avoid cycles
depths: Maps URLs to their depths from the start URL
pending_tasks: Tracks currently running crawl tasks
"""
queue = asyncio.PriorityQueue()
await queue.put((0, 0, start_url))
visited: Set[str] = set()
depths = {start_url: 0}
pending_tasks = set()
try:
while (not queue.empty() or pending_tasks) and not self._cancel_event.is_set():
"""
This sets up our main control loop which:
- Continues while there are URLs to process (not queue.empty())
- Or while there are tasks still running (pending_tasks)
- Can be interrupted via cancellation (not self._cancel_event.is_set())
"""
# Start new tasks up to max_concurrent
while not queue.empty() and len(pending_tasks) < self.max_concurrent:
"""
This section manages task creation:
Checks if we can start more tasks (under max_concurrent limit)
Gets the next URL from the priority queue
Marks URLs as visited immediately to prevent duplicates
Updates current depth in stats
Either:
Creates a new async task (parallel mode)
Processes URL directly (sequential mode)
"""
_, depth, url = await queue.get()
if url not in visited:
visited.add(url)
self.stats.current_depth = depth
if parallel_processing:
task = asyncio.create_task(
self.process_url(url, depth, crawler, queue, visited, depths)
)
pending_tasks.add(task)
else:
result = await self.process_url(
url, depth, crawler, queue, visited, depths
)
if result:
yield result
# Process completed tasks
"""
This section manages completed tasks:
Waits for any task to complete using asyncio.wait
Uses FIRST_COMPLETED to handle results as soon as they're ready
Yields successful results to the caller
Updates pending_tasks to remove completed ones
"""
if pending_tasks:
done, pending_tasks = await asyncio.wait(
pending_tasks,
return_when=asyncio.FIRST_COMPLETED
)
for task in done:
result = await task
if result:
yield result
except Exception as e:
self.logger.error(f"Error in crawl process: {e}")
raise
finally:
# Clean up any remaining tasks
for task in pending_tasks:
task.cancel()
self.stats.end_time = datetime.now()
async def shutdown(self):
"""Clean up resources and stop crawling"""
self._cancel_event.set()
# Clear caches and close connections
self.robot_parsers.clear()
self.domain_queues.clear()

205
crawl4ai/scraper/filters.py Normal file
View File

@@ -0,0 +1,205 @@
# from .url_filter import URLFilter, FilterChain
# from .content_type_filter import ContentTypeFilter
# from .url_pattern_filter import URLPatternFilter
from abc import ABC, abstractmethod
from typing import List, Pattern, Set, Union
import re
from urllib.parse import urlparse
import mimetypes
import logging
from dataclasses import dataclass
import fnmatch
@dataclass
class FilterStats:
"""Statistics for filter applications"""
total_urls: int = 0
rejected_urls: int = 0
passed_urls: int = 0
class URLFilter(ABC):
"""Base class for URL filters"""
def __init__(self, name: str = None):
self.name = name or self.__class__.__name__
self.stats = FilterStats()
self.logger = logging.getLogger(f"urlfilter.{self.name}")
@abstractmethod
def apply(self, url: str) -> bool:
"""Apply the filter to a URL"""
pass
def _update_stats(self, passed: bool):
"""Update filter statistics"""
self.stats.total_urls += 1
if passed:
self.stats.passed_urls += 1
else:
self.stats.rejected_urls += 1
class FilterChain:
"""Chain of URL filters."""
def __init__(self, filters: List[URLFilter] = None):
self.filters = filters or []
self.stats = FilterStats()
self.logger = logging.getLogger("urlfilter.chain")
def add_filter(self, filter_: URLFilter) -> 'FilterChain':
"""Add a filter to the chain"""
self.filters.append(filter_)
return self # Enable method chaining
def apply(self, url: str) -> bool:
"""Apply all filters in the chain"""
self.stats.total_urls += 1
for filter_ in self.filters:
if not filter_.apply(url):
self.stats.rejected_urls += 1
self.logger.debug(f"URL {url} rejected by {filter_.name}")
return False
self.stats.passed_urls += 1
return True
class URLPatternFilter(URLFilter):
"""Filter URLs based on glob patterns or regex.
pattern_filter = URLPatternFilter([
"*.example.com/*", # Glob pattern
"*/article/*", # Path pattern
re.compile(r"blog-\d+") # Regex pattern
])
- Supports glob patterns and regex
- Multiple patterns per filter
- Pattern pre-compilation for performance
"""
def __init__(self, patterns: Union[str, Pattern, List[Union[str, Pattern]]],
use_glob: bool = True):
super().__init__()
self.patterns = [patterns] if isinstance(patterns, (str, Pattern)) else patterns
self.use_glob = use_glob
self._compiled_patterns = []
for pattern in self.patterns:
if isinstance(pattern, str) and use_glob:
self._compiled_patterns.append(self._glob_to_regex(pattern))
else:
self._compiled_patterns.append(re.compile(pattern) if isinstance(pattern, str) else pattern)
def _glob_to_regex(self, pattern: str) -> Pattern:
"""Convert glob pattern to regex"""
return re.compile(fnmatch.translate(pattern))
def apply(self, url: str) -> bool:
"""Check if URL matches any of the patterns"""
matches = any(pattern.search(url) for pattern in self._compiled_patterns)
self._update_stats(matches)
return matches
class ContentTypeFilter(URLFilter):
"""Filter URLs based on expected content type.
content_filter = ContentTypeFilter([
"text/html",
"application/pdf"
], check_extension=True)
- Filter by MIME types
- Extension checking
- Support for multiple content types
"""
def __init__(self, allowed_types: Union[str, List[str]],
check_extension: bool = True):
super().__init__()
self.allowed_types = [allowed_types] if isinstance(allowed_types, str) else allowed_types
self.check_extension = check_extension
self._normalize_types()
def _normalize_types(self):
"""Normalize content type strings"""
self.allowed_types = [t.lower() for t in self.allowed_types]
def _check_extension(self, url: str) -> bool:
"""Check URL's file extension"""
ext = urlparse(url).path.split('.')[-1].lower() if '.' in urlparse(url).path else ''
if not ext:
return True # No extension, might be dynamic content
guessed_type = mimetypes.guess_type(url)[0]
return any(allowed in (guessed_type or '').lower() for allowed in self.allowed_types)
def apply(self, url: str) -> bool:
"""Check if URL's content type is allowed"""
result = True
if self.check_extension:
result = self._check_extension(url)
self._update_stats(result)
return result
class DomainFilter(URLFilter):
"""Filter URLs based on allowed/blocked domains.
domain_filter = DomainFilter(
allowed_domains=["example.com", "blog.example.com"],
blocked_domains=["ads.example.com"]
)
- Allow/block specific domains
- Subdomain support
- Efficient domain matching
"""
def __init__(self, allowed_domains: Union[str, List[str]] = None,
blocked_domains: Union[str, List[str]] = None):
super().__init__()
self.allowed_domains = set(self._normalize_domains(allowed_domains)) if allowed_domains else None
self.blocked_domains = set(self._normalize_domains(blocked_domains)) if blocked_domains else set()
def _normalize_domains(self, domains: Union[str, List[str]]) -> List[str]:
"""Normalize domain strings"""
if isinstance(domains, str):
domains = [domains]
return [d.lower().strip() for d in domains]
def _extract_domain(self, url: str) -> str:
"""Extract domain from URL"""
return urlparse(url).netloc.lower()
def apply(self, url: str) -> bool:
"""Check if URL's domain is allowed"""
domain = self._extract_domain(url)
if domain in self.blocked_domains:
self._update_stats(False)
return False
if self.allowed_domains is not None and domain not in self.allowed_domains:
self._update_stats(False)
return False
self._update_stats(True)
return True
# Example usage:
def create_common_filter_chain() -> FilterChain:
"""Create a commonly used filter chain"""
return FilterChain([
URLPatternFilter([
"*.html", "*.htm", # HTML files
"*/article/*", "*/blog/*" # Common content paths
]),
ContentTypeFilter([
"text/html",
"application/xhtml+xml"
]),
DomainFilter(
blocked_domains=["ads.*", "analytics.*"]
)
])

View File

@@ -0,0 +1,8 @@
from pydantic import BaseModel
from typing import List, Dict
from ..models import CrawlResult
class ScraperResult(BaseModel):
url: str
crawled_urls: List[str]
extracted_data: Dict[str,CrawlResult]

268
crawl4ai/scraper/scorers.py Normal file
View File

@@ -0,0 +1,268 @@
# from .url_scorer import URLScorer
# from .keyword_relevance_scorer import KeywordRelevanceScorer
from abc import ABC, abstractmethod
from typing import List, Dict, Optional, Union
from dataclasses import dataclass
from urllib.parse import urlparse, unquote
import re
from collections import defaultdict
import math
import logging
@dataclass
class ScoringStats:
"""Statistics for URL scoring"""
urls_scored: int = 0
total_score: float = 0.0
min_score: float = float('inf')
max_score: float = float('-inf')
def update(self, score: float):
"""Update scoring statistics"""
self.urls_scored += 1
self.total_score += score
self.min_score = min(self.min_score, score)
self.max_score = max(self.max_score, score)
@property
def average_score(self) -> float:
"""Calculate average score"""
return self.total_score / self.urls_scored if self.urls_scored > 0 else 0.0
class URLScorer(ABC):
"""Base class for URL scoring strategies"""
def __init__(self, weight: float = 1.0, name: str = None):
self.weight = weight
self.name = name or self.__class__.__name__
self.stats = ScoringStats()
self.logger = logging.getLogger(f"urlscorer.{self.name}")
@abstractmethod
def _calculate_score(self, url: str) -> float:
"""Calculate the raw score for a URL"""
pass
def score(self, url: str) -> float:
"""Calculate the weighted score for a URL"""
raw_score = self._calculate_score(url)
weighted_score = raw_score * self.weight
self.stats.update(weighted_score)
return weighted_score
class CompositeScorer(URLScorer):
"""Combines multiple scorers with weights"""
def __init__(self, scorers: List[URLScorer], normalize: bool = True):
super().__init__(name="CompositeScorer")
self.scorers = scorers
self.normalize = normalize
def _calculate_score(self, url: str) -> float:
scores = [scorer.score(url) for scorer in self.scorers]
total_score = sum(scores)
if self.normalize and scores:
total_score /= len(scores)
return total_score
class KeywordRelevanceScorer(URLScorer):
"""Score URLs based on keyword relevance.
keyword_scorer = KeywordRelevanceScorer(
keywords=["python", "programming"],
weight=1.0,
case_sensitive=False
)
- Score based on keyword matches
- Case sensitivity options
- Weighted scoring
"""
def __init__(self, keywords: List[str], weight: float = 1.0,
case_sensitive: bool = False):
super().__init__(weight=weight)
self.keywords = keywords
self.case_sensitive = case_sensitive
self._compile_keywords()
def _compile_keywords(self):
"""Prepare keywords for matching"""
flags = 0 if self.case_sensitive else re.IGNORECASE
self.patterns = [re.compile(re.escape(k), flags) for k in self.keywords]
def _calculate_score(self, url: str) -> float:
"""Calculate score based on keyword matches"""
decoded_url = unquote(url)
total_matches = sum(
1 for pattern in self.patterns
if pattern.search(decoded_url)
)
# Normalize score between 0 and 1
return total_matches / len(self.patterns) if self.patterns else 0.0
class PathDepthScorer(URLScorer):
"""Score URLs based on their path depth.
path_scorer = PathDepthScorer(
optimal_depth=3, # Preferred URL depth
weight=0.7
)
- Score based on URL path depth
- Configurable optimal depth
- Diminishing returns for deeper paths
"""
def __init__(self, optimal_depth: int = 3, weight: float = 1.0):
super().__init__(weight=weight)
self.optimal_depth = optimal_depth
def _calculate_score(self, url: str) -> float:
"""Calculate score based on path depth"""
path = urlparse(url).path
depth = len([x for x in path.split('/') if x])
# Score decreases as we move away from optimal depth
distance_from_optimal = abs(depth - self.optimal_depth)
return 1.0 / (1.0 + distance_from_optimal)
class ContentTypeScorer(URLScorer):
"""Score URLs based on content type preferences.
content_scorer = ContentTypeScorer({
r'\.html$': 1.0,
r'\.pdf$': 0.8,
r'\.xml$': 0.6
})
- Score based on file types
- Configurable type weights
- Pattern matching support
"""
def __init__(self, type_weights: Dict[str, float], weight: float = 1.0):
super().__init__(weight=weight)
self.type_weights = type_weights
self._compile_patterns()
def _compile_patterns(self):
"""Prepare content type patterns"""
self.patterns = {
re.compile(pattern): weight
for pattern, weight in self.type_weights.items()
}
def _calculate_score(self, url: str) -> float:
"""Calculate score based on content type matching"""
for pattern, weight in self.patterns.items():
if pattern.search(url):
return weight
return 0.0
class FreshnessScorer(URLScorer):
"""Score URLs based on freshness indicators.
freshness_scorer = FreshnessScorer(weight=0.9)
Score based on date indicators in URLs
Multiple date format support
Recency weighting"""
def __init__(self, weight: float = 1.0):
super().__init__(weight=weight)
self.date_patterns = [
r'/(\d{4})/(\d{2})/(\d{2})/', # yyyy/mm/dd
r'(\d{4})[-_](\d{2})[-_](\d{2})', # yyyy-mm-dd
r'/(\d{4})/', # year only
]
self._compile_patterns()
def _compile_patterns(self):
"""Prepare date patterns"""
self.compiled_patterns = [re.compile(p) for p in self.date_patterns]
def _calculate_score(self, url: str) -> float:
"""Calculate score based on date indicators"""
for pattern in self.compiled_patterns:
if match := pattern.search(url):
year = int(match.group(1))
# Score higher for more recent years
return 1.0 - (2024 - year) * 0.1
return 0.5 # Default score for URLs without dates
class DomainAuthorityScorer(URLScorer):
"""Score URLs based on domain authority.
authority_scorer = DomainAuthorityScorer({
"python.org": 1.0,
"github.com": 0.9,
"medium.com": 0.7
})
Score based on domain importance
Configurable domain weights
Default weight for unknown domains"""
def __init__(self, domain_weights: Dict[str, float],
default_weight: float = 0.5, weight: float = 1.0):
super().__init__(weight=weight)
self.domain_weights = domain_weights
self.default_weight = default_weight
def _calculate_score(self, url: str) -> float:
"""Calculate score based on domain authority"""
domain = urlparse(url).netloc.lower()
return self.domain_weights.get(domain, self.default_weight)
def create_balanced_scorer() -> CompositeScorer:
"""Create a balanced composite scorer"""
return CompositeScorer([
KeywordRelevanceScorer(
keywords=["article", "blog", "news", "research"],
weight=1.0
),
PathDepthScorer(
optimal_depth=3,
weight=0.7
),
ContentTypeScorer(
type_weights={
r'\.html?$': 1.0,
r'\.pdf$': 0.8,
r'\.xml$': 0.6
},
weight=0.8
),
FreshnessScorer(
weight=0.9
)
])
# Example Usage:
"""
# Create a composite scorer
scorer = CompositeScorer([
KeywordRelevanceScorer(["python", "programming"], weight=1.0),
PathDepthScorer(optimal_depth=2, weight=0.7),
FreshnessScorer(weight=0.8),
DomainAuthorityScorer(
domain_weights={
"python.org": 1.0,
"github.com": 0.9,
"medium.com": 0.7
},
weight=0.9
)
])
# Score a URL
score = scorer.score("https://python.org/article/2024/01/new-features")
# Access statistics
print(f"Average score: {scorer.stats.average_score}")
print(f"URLs scored: {scorer.stats.urls_scored}")
"""

View File

@@ -0,0 +1,26 @@
from abc import ABC, abstractmethod
from .models import ScraperResult, CrawlResult
from ..models import CrawlResult
from ..async_webcrawler import AsyncWebCrawler
from typing import Union, AsyncGenerator
class ScraperStrategy(ABC):
@abstractmethod
async def ascrape(self, url: str, crawler: AsyncWebCrawler, parallel_processing: bool = True, stream: bool = False) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
"""Scrape the given URL using the specified crawler.
Args:
url (str): The starting URL for the scrape.
crawler (AsyncWebCrawler): The web crawler instance.
parallel_processing (bool): Whether to use parallel processing. Defaults to True.
stream (bool): If True, yields individual crawl results as they are ready;
if False, accumulates results and returns a final ScraperResult.
Yields:
CrawlResult: Individual crawl results if stream is True.
Returns:
ScraperResult: A summary of the scrape results containing the final extracted data
and the list of crawled URLs if stream is False.
"""
pass

View File

@@ -1,12 +1,13 @@
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from bs4 import BeautifulSoup, Comment, element, Tag, NavigableString
import html2text
import json
import html
import re
import os
import platform
from .html2text import HTML2Text
from html2text import HTML2Text
from .prompts import PROMPT_EXTRACT_BLOCKS
from .config import *
from pathlib import Path
@@ -181,22 +182,9 @@ def escape_json_string(s):
class CustomHTML2Text(HTML2Text):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.ignore_links = True
self.inside_pre = False
self.inside_code = False
self.skip_internal_links = False
self.single_line_break = False
self.mark_code = False
self.include_sup_sub = False
self.body_width = 0
self.ignore_mailto_links = True
self.ignore_links = False
self.escape_backslash = False
self.escape_dot = False
self.escape_plus = False
self.escape_dash = False
self.escape_snob = False
def handle_tag(self, tag, attrs, start):
if tag == 'pre':
@@ -206,10 +194,6 @@ class CustomHTML2Text(HTML2Text):
else:
self.o('\n```')
self.inside_pre = False
elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
pass
# elif tag == 'code' and not self.inside_pre:
# if start:
# if not self.inside_pre:
@@ -708,8 +692,8 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
for img in imgs:
src = img.get('src', '')
if base64_pattern.match(src):
# Replace base64 data with empty string
img['src'] = base64_pattern.sub('', src)
cleaned_html = str(body).replace('\n\n', '\n').replace(' ', ' ')
cleaned_html = sanitize_html(cleaned_html)
@@ -980,53 +964,4 @@ def format_html(html_string):
soup = BeautifulSoup(html_string, 'html.parser')
return soup.prettify()
def normalize_url(href, base_url):
"""Normalize URLs to ensure consistent format"""
# Extract protocol and domain from base URL
try:
base_parts = base_url.split('/')
protocol = base_parts[0]
domain = base_parts[2]
except IndexError:
raise ValueError(f"Invalid base URL format: {base_url}")
# Handle special protocols
special_protocols = {'mailto:', 'tel:', 'ftp:', 'file:', 'data:', 'javascript:'}
if any(href.lower().startswith(proto) for proto in special_protocols):
return href.strip()
# Handle anchor links
if href.startswith('#'):
return f"{base_url}{href}"
# Handle protocol-relative URLs
if href.startswith('//'):
return f"{protocol}{href}"
# Handle root-relative URLs
if href.startswith('/'):
return f"{protocol}//{domain}{href}"
# Handle relative URLs
if not href.startswith(('http://', 'https://')):
# Remove leading './' if present
href = href.lstrip('./')
return f"{protocol}//{domain}/{href}"
return href.strip()
def is_external_url(url, base_domain):
"""Determine if a URL is external"""
special_protocols = {'mailto:', 'tel:', 'ftp:', 'file:', 'data:', 'javascript:'}
if any(url.lower().startswith(proto) for proto in special_protocols):
return True
try:
# Handle URLs with protocol
if url.startswith(('http://', 'https://')):
url_domain = url.split('/')[2]
return base_domain.lower() not in url_domain.lower()
except IndexError:
return False
return False

View File

@@ -1,157 +0,0 @@
### Extraction Strategies
#### 1. LLMExtractionStrategy
```python
LLMExtractionStrategy(
# Core Parameters
provider: str = DEFAULT_PROVIDER, # LLM provider (e.g., "openai/gpt-4", "huggingface/...", "ollama/...")
api_token: Optional[str] = None, # API token for the provider
instruction: str = None, # Custom instruction for extraction
schema: Dict = None, # Pydantic model schema for structured extraction
extraction_type: str = "block", # Type of extraction: "block" or "schema"
# Chunking Parameters
chunk_token_threshold: int = CHUNK_TOKEN_THRESHOLD, # Maximum tokens per chunk
overlap_rate: float = OVERLAP_RATE, # Overlap between chunks
word_token_rate: float = WORD_TOKEN_RATE, # Conversion rate from words to tokens
apply_chunking: bool = True, # Whether to apply text chunking
# API Configuration
base_url: str = None, # Base URL for API calls
api_base: str = None, # Alternative base URL
extra_args: Dict = {}, # Additional provider-specific arguments
verbose: bool = False # Enable verbose logging
)
```
Usage Example:
```python
class NewsArticle(BaseModel):
title: str
content: str
strategy = LLMExtractionStrategy(
provider="ollama/nemotron",
api_token="your-token",
schema=NewsArticle.schema(),
instruction="Extract news article content with title and main text"
)
result = await crawler.arun(url="https://example.com", extraction_strategy=strategy)
```
#### 2. JsonCssExtractionStrategy
```python
JsonCssExtractionStrategy(
schema: Dict[str, Any], # Schema defining extraction rules
verbose: bool = False # Enable verbose logging
)
# Schema Structure
schema = {
"name": str, # Name of the extraction schema
"baseSelector": str, # CSS selector for base elements
"fields": [
{
"name": str, # Field name
"selector": str, # CSS selector
"type": str, # Field type: "text", "attribute", "html", "regex", "nested", "list", "nested_list"
"attribute": str, # For type="attribute"
"pattern": str, # For type="regex"
"transform": str, # Optional: "lowercase", "uppercase", "strip"
"default": Any, # Default value if extraction fails
"fields": List[Dict], # For nested/list types
}
]
}
```
Usage Example:
```python
schema = {
"name": "News Articles",
"baseSelector": "article.news-item",
"fields": [
{
"name": "title",
"selector": "h1",
"type": "text",
"transform": "strip"
},
{
"name": "date",
"selector": ".date",
"type": "attribute",
"attribute": "datetime"
}
]
}
strategy = JsonCssExtractionStrategy(schema)
result = await crawler.arun(url="https://example.com", extraction_strategy=strategy)
```
#### 3. CosineStrategy
```python
CosineStrategy(
# Content Filtering
semantic_filter: str = None, # Keyword filter for document filtering
word_count_threshold: int = 10, # Minimum words per cluster
sim_threshold: float = 0.3, # Similarity threshold for filtering
# Clustering Parameters
max_dist: float = 0.2, # Maximum distance for clustering
linkage_method: str = 'ward', # Clustering linkage method
top_k: int = 3, # Number of top categories to extract
# Model Configuration
model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', # Embedding model
verbose: bool = False # Enable verbose logging
)
```
### Chunking Strategies
#### 1. RegexChunking
```python
RegexChunking(
patterns: List[str] = None # List of regex patterns for splitting text
# Default pattern: [r'\n\n']
)
```
Usage Example:
```python
chunker = RegexChunking(patterns=[r'\n\n', r'\.\s+']) # Split on double newlines and sentences
chunks = chunker.chunk(text)
```
#### 2. SlidingWindowChunking
```python
SlidingWindowChunking(
window_size: int = 100, # Size of the window in words
step: int = 50, # Number of words to slide the window
)
```
Usage Example:
```python
chunker = SlidingWindowChunking(window_size=200, step=100)
chunks = chunker.chunk(text) # Creates overlapping chunks of 200 words, moving 100 words at a time
```
#### 3. OverlappingWindowChunking
```python
OverlappingWindowChunking(
window_size: int = 1000, # Size of each chunk in words
overlap: int = 100 # Number of words to overlap between chunks
)
```
Usage Example:
```python
chunker = OverlappingWindowChunking(window_size=500, overlap=50)
chunks = chunker.chunk(text) # Creates 500-word chunks with 50-word overlap
```

View File

@@ -1,175 +0,0 @@
# Features
## Current Features
1. Async-first architecture for high-performance web crawling
2. Built-in anti-bot detection bypass ("magic mode")
3. Multiple browser engine support (Chromium, Firefox, WebKit)
4. Smart session management with automatic cleanup
5. Automatic content cleaning and relevance scoring
6. Built-in markdown generation with formatting preservation
7. Intelligent image scoring and filtering
8. Automatic popup and overlay removal
9. Smart wait conditions (CSS/JavaScript based)
10. Multi-provider LLM integration (OpenAI, HuggingFace, Ollama)
11. Schema-based structured data extraction
12. Automated iframe content processing
13. Intelligent link categorization (internal/external)
14. Multiple chunking strategies for large content
15. Real-time HTML cleaning and sanitization
16. Automatic screenshot capabilities
17. Social media link filtering
18. Semantic similarity-based content clustering
19. Human behavior simulation for anti-bot bypass
20. Proxy support with authentication
21. Automatic resource cleanup
22. Custom CSS selector-based extraction
23. Automatic content relevance scoring ("fit" content)
24. Recursive website crawling capabilities
25. Flexible hook system for customization
26. Built-in caching system
27. Domain-based content filtering
28. Dynamic content handling with JavaScript execution
29. Automatic media content extraction and classification
30. Metadata extraction and processing
31. Customizable HTML to Markdown conversion
32. Token-aware content chunking for LLM processing
33. Automatic response header and status code handling
34. Browser fingerprint customization
35. Multiple extraction strategies (LLM, CSS, Cosine, XPATH)
36. Automatic error image generation for failed screenshots
37. Smart content overlap handling for large texts
38. Built-in rate limiting for batch processing
39. Automatic cookie handling
40. Browser Console logging and debugging capabilities
## Feature Techs
• Browser Management
- Asynchronous browser control
- Multi-browser support (Chromium, Firefox, WebKit)
- Headless mode support
- Browser cleanup and resource management
- Custom browser arguments and configuration
- Context management with `__aenter__` and `__aexit__`
• Session Handling
- Session management with TTL (Time To Live)
- Session reuse capabilities
- Session cleanup for expired sessions
- Session-based context preservation
• Stealth Features
- Playwright stealth configuration
- Navigator properties override
- WebDriver detection evasion
- Chrome app simulation
- Plugin simulation
- Language preferences simulation
- Hardware concurrency simulation
- Media codecs simulation
• Network Features
- Proxy support with authentication
- Custom headers management
- Cookie handling
- Response header capture
- Status code tracking
- Network idle detection
• Page Interaction
- Smart wait functionality for multiple conditions
- CSS selector-based waiting
- JavaScript condition waiting
- Custom JavaScript execution
- User interaction simulation (mouse/keyboard)
- Page scrolling
- Timeout management
- Load state monitoring
• Content Processing
- HTML content extraction
- Iframe processing and content extraction
- Delayed content retrieval
- Content caching
- Cache file management
- HTML cleaning and processing
• Image Handling
- Screenshot capabilities (full page)
- Base64 encoding of screenshots
- Image dimension updating
- Image filtering (size/visibility)
- Error image generation
- Natural width/height preservation
• Overlay Management
- Popup removal
- Cookie notice removal
- Newsletter dialog removal
- Modal removal
- Fixed position element removal
- Z-index based overlay detection
- Visibility checking
• Hook System
- Browser creation hooks
- User agent update hooks
- Execution start hooks
- Navigation hooks (before/after goto)
- HTML retrieval hooks
- HTML return hooks
• Error Handling
- Browser error catching
- Network error handling
- Timeout handling
- Screenshot error recovery
- Invalid selector handling
- General exception management
• Performance Features
- Concurrent URL processing
- Semaphore-based rate limiting
- Async gathering of results
- Resource cleanup
- Memory management
• Debug Features
- Console logging
- Page error logging
- Verbose mode
- Error message generation
- Warning system
• Security Features
- Certificate error handling
- Sandbox configuration
- GPU handling
- CSP (Content Security Policy) compliant waiting
• Configuration
- User agent customization
- Viewport configuration
- Timeout configuration
- Browser type selection
- Proxy configuration
- Header configuration
• Data Models
- Pydantic model for responses
- Type hints throughout code
- Structured response format
- Optional response fields
• File System Integration
- Cache directory management
- File path handling
- Cache metadata storage
- File read/write operations
• Metadata Handling
- Response headers capture
- Status code tracking
- Cache metadata
- Session tracking
- Timestamp management

View File

@@ -1,150 +0,0 @@
### 1. Basic Web Crawling
```python
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
print(result.markdown) # Get clean markdown content
print(result.html) # Get raw HTML
print(result.cleaned_html) # Get cleaned HTML
```
### 2. Browser Control Options
- Multiple Browser Support
```python
# Choose between different browser engines
crawler = AsyncWebCrawler(browser_type="firefox") # or "chromium", "webkit"
crawler = AsyncWebCrawler(headless=False) # For visible browser
```
- Proxy Configuration
```python
crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
# Or with authentication
crawler = AsyncWebCrawler(proxy_config={
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
})
```
### 3. Content Selection & Filtering
- CSS Selector Support
```python
result = await crawler.arun(
url="https://example.com",
css_selector=".main-content" # Extract specific content
)
```
- Content Filtering Options
```python
result = await crawler.arun(
url="https://example.com",
word_count_threshold=10, # Minimum words per block
excluded_tags=['form', 'header'], # Tags to exclude
exclude_external_links=True, # Remove external links
exclude_social_media_links=True, # Remove social media links
exclude_external_images=True # Remove external images
)
```
### 4. Dynamic Content Handling
- JavaScript Execution
```python
result = await crawler.arun(
url="https://example.com",
js_code="window.scrollTo(0, document.body.scrollHeight)" # Execute custom JS
)
```
- Wait Conditions
```python
result = await crawler.arun(
url="https://example.com",
wait_for="css:.my-element", # Wait for element
wait_for="js:() => document.readyState === 'complete'" # Wait for condition
)
```
### 5. Anti-Bot Protection Handling
```python
result = await crawler.arun(
url="https://example.com",
simulate_user=True, # Simulate human behavior
override_navigator=True, # Mask automation signals
magic=True # Enable all anti-detection features
)
```
### 6. Session Management
```python
session_id = "my_session"
result1 = await crawler.arun(url="https://example.com/page1", session_id=session_id)
result2 = await crawler.arun(url="https://example.com/page2", session_id=session_id)
await crawler.crawler_strategy.kill_session(session_id)
```
### 7. Media Handling
- Screenshot Capture
```python
result = await crawler.arun(
url="https://example.com",
screenshot=True
)
base64_screenshot = result.screenshot
```
- Media Extraction
```python
result = await crawler.arun(url="https://example.com")
print(result.media['images']) # List of images
print(result.media['videos']) # List of videos
print(result.media['audios']) # List of audio files
```
### 8. Structured Data Extraction
- CSS-based Extraction
```python
schema = {
"name": "News Articles",
"baseSelector": "article",
"fields": [
{"name": "title", "selector": "h1", "type": "text"},
{"name": "date", "selector": ".date", "type": "text"}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema)
result = await crawler.arun(
url="https://example.com",
extraction_strategy=extraction_strategy
)
structured_data = json.loads(result.extracted_content)
```
- LLM-based Extraction (Multiple Providers)
```python
class NewsArticle(BaseModel):
title: str
summary: str
strategy = LLMExtractionStrategy(
provider="ollama/nemotron", # or "huggingface/...", "ollama/..."
api_token="your-token",
schema=NewsArticle.schema(),
instruction="Extract news article details..."
)
result = await crawler.arun(
url="https://example.com",
extraction_strategy=strategy
)
```
### 9. Content Cleaning & Processing
```python
result = await crawler.arun(
url="https://example.com",
remove_overlay_elements=True, # Remove popups/modals
process_iframes=True, # Process iframe content
)
print(result.fit_markdown) # Get most relevant content
print(result.fit_html) # Get cleaned HTML
```

View File

@@ -1,457 +0,0 @@
I'll expand the outline with detailed descriptions and examples based on all the provided files. I'll start with the first few sections:
### 1. Basic Web Crawling
Basic web crawling provides the foundation for extracting content from websites. The library supports both simple single-page crawling and recursive website crawling.
```python
# Simple page crawling
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
print(result.html) # Raw HTML
print(result.markdown) # Cleaned markdown
print(result.cleaned_html) # Cleaned HTML
# Recursive website crawling
class SimpleWebsiteScraper:
def __init__(self, crawler: AsyncWebCrawler):
self.crawler = crawler
async def scrape(self, start_url: str, max_depth: int):
results = await self.scrape_recursive(start_url, max_depth)
return results
# Usage
async with AsyncWebCrawler() as crawler:
scraper = SimpleWebsiteScraper(crawler)
results = await scraper.scrape("https://example.com", depth=2)
```
### 2. Browser Control Options
The library provides extensive control over browser behavior, allowing customization of browser type, headless mode, and proxy settings.
```python
# Browser Type Selection
async with AsyncWebCrawler(
browser_type="firefox", # Options: "chromium", "firefox", "webkit"
headless=False, # For visible browser
verbose=True # Enable logging
) as crawler:
result = await crawler.arun(url="https://example.com")
# Proxy Configuration
async with AsyncWebCrawler(
proxy_config={
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
},
headers={
"User-Agent": "Custom User Agent",
"Accept-Language": "en-US,en;q=0.9"
}
) as crawler:
result = await crawler.arun(url="https://example.com")
```
### 3. Content Selection & Filtering
The library offers multiple ways to select and filter content, from CSS selectors to word count thresholds.
```python
# CSS Selector and Content Filtering
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
css_selector="article.main-content", # Extract specific content
word_count_threshold=10, # Minimum words per block
excluded_tags=['form', 'header'], # Tags to exclude
exclude_external_links=True, # Remove external links
exclude_social_media_links=True, # Remove social media links
exclude_domains=["pinterest.com", "facebook.com"] # Exclude specific domains
)
# Custom HTML to Text Options
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
html2text={
"escape_dot": False,
"links_each_paragraph": True,
"protect_links": True
}
)
```
### 4. Dynamic Content Handling
The library provides sophisticated handling of dynamic content with JavaScript execution and wait conditions.
```python
# JavaScript Execution and Wait Conditions
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
js_code=[
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more').click();"
],
wait_for="css:.dynamic-content", # Wait for element
delay_before_return_html=2.0 # Wait after JS execution
)
# Smart Wait Conditions
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
wait_for="""() => {
return document.querySelectorAll('.item').length > 10;
}""",
page_timeout=60000 # 60 seconds timeout
)
```
### 5. Advanced Link Analysis
The library provides comprehensive link analysis capabilities, distinguishing between internal and external links, with options for filtering and processing.
```python
# Basic Link Analysis
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
# Access internal and external links
for internal_link in result.links['internal']:
print(f"Internal: {internal_link['href']} - {internal_link['text']}")
for external_link in result.links['external']:
print(f"External: {external_link['href']} - {external_link['text']}")
# Advanced Link Filtering
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
exclude_external_links=True, # Remove all external links
exclude_social_media_links=True, # Remove social media links
exclude_social_media_domains=[ # Custom social media domains
"facebook.com", "twitter.com", "instagram.com"
],
exclude_domains=["pinterest.com"] # Specific domains to exclude
)
```
### 6. Anti-Bot Protection Handling
The library includes sophisticated anti-detection mechanisms to handle websites with bot protection.
```python
# Basic Anti-Detection
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
simulate_user=True, # Simulate human behavior
override_navigator=True # Override navigator properties
)
# Advanced Anti-Detection with Magic Mode
async with AsyncWebCrawler(headless=False) as crawler:
result = await crawler.arun(
url="https://example.com",
magic=True, # Enable all anti-detection features
remove_overlay_elements=True, # Remove popups/modals automatically
# Custom navigator properties
js_code="""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
"""
)
```
### 7. Session Management
Session management allows maintaining state across multiple requests and handling cookies.
```python
# Basic Session Management
async with AsyncWebCrawler() as crawler:
session_id = "my_session"
# Login
login_result = await crawler.arun(
url="https://example.com/login",
session_id=session_id,
js_code="document.querySelector('form').submit();"
)
# Use same session for subsequent requests
protected_result = await crawler.arun(
url="https://example.com/protected",
session_id=session_id
)
# Clean up session
await crawler.crawler_strategy.kill_session(session_id)
# Advanced Session with Custom Cookies
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
session_id="custom_session",
cookies=[{
"name": "sessionId",
"value": "abc123",
"domain": "example.com"
}]
)
```
### 8. Screenshot and Media Handling
The library provides comprehensive media handling capabilities, including screenshots and media content extraction.
```python
# Screenshot Capture
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
screenshot=True,
screenshot_wait_for=2.0 # Wait before taking screenshot
)
# Save screenshot
if result.screenshot:
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
# Media Extraction
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
# Process images with metadata
for image in result.media['images']:
print(f"Image: {image['src']}")
print(f"Alt text: {image['alt']}")
print(f"Context: {image['desc']}")
print(f"Relevance score: {image['score']}")
# Process videos and audio
for video in result.media['videos']:
print(f"Video: {video['src']}")
for audio in result.media['audios']:
print(f"Audio: {audio['src']}")
```
### 9. Structured Data Extraction & Chunking
The library supports multiple strategies for structured data extraction and content chunking.
```python
# LLM-based Extraction
class NewsArticle(BaseModel):
title: str
content: str
author: str
extraction_strategy = LLMExtractionStrategy(
provider='openai/gpt-4',
api_token="your-token",
schema=NewsArticle.schema(),
instruction="Extract news article details",
chunk_token_threshold=1000,
overlap_rate=0.1
)
# CSS-based Extraction
schema = {
"name": "Product Listing",
"baseSelector": ".product-card",
"fields": [
{
"name": "title",
"selector": "h2",
"type": "text"
},
{
"name": "price",
"selector": ".price",
"type": "text",
"transform": "strip"
}
]
}
css_strategy = JsonCssExtractionStrategy(schema)
# Text Chunking
from crawl4ai.chunking_strategy import OverlappingWindowChunking
chunking_strategy = OverlappingWindowChunking(
window_size=1000,
overlap=100
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
extraction_strategy=extraction_strategy,
chunking_strategy=chunking_strategy
)
```
### 10. Content Cleaning & Processing
The library provides extensive content cleaning and processing capabilities, ensuring high-quality output in various formats.
```python
# Basic Content Cleaning
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
remove_overlay_elements=True, # Remove popups/modals
process_iframes=True, # Process iframe content
word_count_threshold=10 # Minimum words per block
)
print(result.cleaned_html) # Clean HTML
print(result.fit_html) # Most relevant HTML content
print(result.fit_markdown) # Most relevant markdown content
# Advanced Content Processing
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
excluded_tags=['form', 'header', 'footer', 'nav'],
html2text={
"escape_dot": False,
"body_width": 0,
"protect_links": True,
"unicode_snob": True,
"ignore_links": False,
"ignore_images": False,
"ignore_emphasis": False,
"bypass_tables": False,
"ignore_tables": False
}
)
```
### Advanced Usage Patterns
#### 1. Combining Multiple Features
```python
async with AsyncWebCrawler(
browser_type="chromium",
headless=False,
verbose=True
) as crawler:
result = await crawler.arun(
url="https://example.com",
# Anti-bot measures
magic=True,
simulate_user=True,
# Content selection
css_selector="article.main",
word_count_threshold=10,
# Dynamic content handling
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="css:.dynamic-content",
# Content filtering
exclude_external_links=True,
exclude_social_media_links=True,
# Media handling
screenshot=True,
process_iframes=True,
# Content cleaning
remove_overlay_elements=True
)
```
#### 2. Custom Extraction Pipeline
```python
# Define custom schemas and strategies
class Article(BaseModel):
title: str
content: str
date: str
# CSS extraction for initial content
css_schema = {
"name": "Article Extraction",
"baseSelector": "article",
"fields": [
{"name": "title", "selector": "h1", "type": "text"},
{"name": "content", "selector": ".content", "type": "html"},
{"name": "date", "selector": ".date", "type": "text"}
]
}
# LLM processing for semantic analysis
llm_strategy = LLMExtractionStrategy(
provider="ollama/nemotron",
api_token="your-token",
schema=Article.schema(),
instruction="Extract and clean article content"
)
# Chunking strategy for large content
chunking = OverlappingWindowChunking(window_size=1000, overlap=100)
async with AsyncWebCrawler() as crawler:
# First pass: Extract structure
css_result = await crawler.arun(
url="https://example.com",
extraction_strategy=JsonCssExtractionStrategy(css_schema)
)
# Second pass: Semantic processing
llm_result = await crawler.arun(
url="https://example.com",
extraction_strategy=llm_strategy,
chunking_strategy=chunking
)
```
#### 3. Website Crawling with Custom Processing
```python
class CustomWebsiteCrawler:
def __init__(self, crawler: AsyncWebCrawler):
self.crawler = crawler
self.results = {}
async def process_page(self, url: str) -> Dict:
result = await self.crawler.arun(
url=url,
magic=True,
word_count_threshold=10,
exclude_external_links=True,
process_iframes=True,
remove_overlay_elements=True
)
# Process internal links
internal_links = [
link['href'] for link in result.links['internal']
if self._is_valid_link(link['href'])
]
# Extract media
media_urls = [img['src'] for img in result.media['images']]
return {
'content': result.markdown,
'links': internal_links,
'media': media_urls,
'metadata': result.metadata
}
async def crawl_website(self, start_url: str, max_depth: int = 2):
visited = set()
queue = [(start_url, 0)]
while queue:
url, depth = queue.pop(0)
if depth > max_depth or url in visited:
continue
visited.add(url)
self.results[url] = await self.process_page(url)
```

View File

@@ -1,282 +0,0 @@
### AsyncWebCrawler Constructor Parameters
```python
AsyncWebCrawler(
# Core Browser Settings
browser_type: str = "chromium", # Options: "chromium", "firefox", "webkit"
headless: bool = True, # Whether to run browser in headless mode
verbose: bool = False, # Enable verbose logging
# Cache Settings
always_by_pass_cache: bool = False, # Always bypass cache regardless of run settings
base_directory: str = str(Path.home()), # Base directory for cache storage
# Network Settings
proxy: str = None, # Simple proxy URL (e.g., "http://proxy.example.com:8080")
proxy_config: Dict = None, # Advanced proxy settings with auth: {"server": str, "username": str, "password": str}
# Browser Behavior
sleep_on_close: bool = False, # Wait before closing browser
# Other Settings passed to AsyncPlaywrightCrawlerStrategy
user_agent: str = None, # Custom user agent string
headers: Dict[str, str] = {}, # Custom HTTP headers
js_code: Union[str, List[str]] = None, # Default JavaScript to execute
)
```
### arun() Method Parameters
```python
arun(
# Core Parameters
url: str, # Required: URL to crawl
# Content Selection
css_selector: str = None, # CSS selector to extract specific content
word_count_threshold: int = MIN_WORD_THRESHOLD, # Minimum words for content blocks
# Cache Control
bypass_cache: bool = False, # Bypass cache for this request
# Session Management
session_id: str = None, # Session identifier for persistent browsing
# Screenshot Options
screenshot: bool = False, # Take page screenshot
screenshot_wait_for: float = None, # Wait time before screenshot
# Content Processing
process_iframes: bool = False, # Process iframe content
remove_overlay_elements: bool = False, # Remove popups/modals
# Anti-Bot/Detection
simulate_user: bool = False, # Simulate human-like behavior
override_navigator: bool = False, # Override navigator properties
magic: bool = False, # Enable all anti-detection features
# Content Filtering
excluded_tags: List[str] = None, # HTML tags to exclude
exclude_external_links: bool = False, # Remove external links
exclude_social_media_links: bool = False, # Remove social media links
exclude_external_images: bool = False, # Remove external images
exclude_social_media_domains: List[str] = None, # Additional social media domains to exclude
remove_forms: bool = False, # Remove all form elements
# JavaScript Handling
js_code: Union[str, List[str]] = None, # JavaScript to execute
js_only: bool = False, # Only execute JavaScript without reloading page
wait_for: str = None, # Wait condition (CSS selector or JS function)
# Page Loading
page_timeout: int = 60000, # Page load timeout in milliseconds
delay_before_return_html: float = None, # Wait before returning HTML
# Debug Options
log_console: bool = False, # Log browser console messages
# Content Format Control
only_text: bool = False, # Extract only text content
keep_data_attributes: bool = False, # Keep data-* attributes in HTML
# Markdown Options
include_links_on_markdown: bool = False, # Include links in markdown output
html2text: Dict = {}, # HTML to text conversion options
# Extraction Strategy
extraction_strategy: ExtractionStrategy = None, # Strategy for structured data extraction
# Advanced Browser Control
user_agent: str = None, # Override user agent for this request
)
```
### Extraction Strategy Parameters
```python
# JsonCssExtractionStrategy
{
"name": str, # Name of extraction schema
"baseSelector": str, # Base CSS selector
"fields": [
{
"name": str, # Field name
"selector": str, # CSS selector
"type": str, # Data type ("text", etc.)
"transform": str = None # Optional transformation
}
]
}
# LLMExtractionStrategy
{
"provider": str, # LLM provider (e.g., "openai/gpt-4", "huggingface/...", "ollama/...")
"api_token": str, # API token
"schema": dict, # Pydantic model schema
"extraction_type": str, # Type of extraction ("schema", etc.)
"instruction": str, # Extraction instruction
"extra_args": dict = None, # Additional provider-specific arguments
"extra_headers": dict = None # Additional HTTP headers
}
```
### HTML to Text Conversion Options (html2text parameter)
```python
{
"escape_dot": bool = True, # Escape dots in text
# Other html2text library options
}
```
### CrawlResult Fields
```python
class CrawlResult(BaseModel):
# Basic Information
url: str # The crawled URL
# Example: "https://example.com"
success: bool # Whether the crawl was successful
# Example: True/False
status_code: Optional[int] # HTTP status code
# Example: 200, 404, 500
# Content Fields
html: str # Raw HTML content
# Example: "<html><body>...</body></html>"
cleaned_html: Optional[str] # HTML after cleaning and processing
# Example: "<article><p>Clean content...</p></article>"
fit_html: Optional[str] # Most relevant HTML content after content cleaning strategy
# Example: "<div><p>Most relevant content...</p></div>"
markdown: Optional[str] # HTML converted to markdown
# Example: "# Title\n\nContent paragraph..."
fit_markdown: Optional[str] # Most relevant content in markdown
# Example: "# Main Article\n\nKey content..."
# Media Content
media: Dict[str, List[Dict]] = {} # Extracted media information
# Example: {
# "images": [
# {
# "src": "https://example.com/image.jpg",
# "alt": "Image description",
# "desc": "Contextual description",
# "score": 5, # Relevance score
# "type": "image"
# }
# ],
# "videos": [
# {
# "src": "https://example.com/video.mp4",
# "alt": "Video title",
# "type": "video",
# "description": "Video context"
# }
# ],
# "audios": [
# {
# "src": "https://example.com/audio.mp3",
# "alt": "Audio title",
# "type": "audio",
# "description": "Audio context"
# }
# ]
# }
# Link Information
links: Dict[str, List[Dict]] = {} # Extracted links
# Example: {
# "internal": [
# {
# "href": "https://example.com/page",
# "text": "Link text",
# "title": "Link title"
# }
# ],
# "external": [
# {
# "href": "https://external.com",
# "text": "External link text",
# "title": "External link title"
# }
# ]
# }
# Extraction Results
extracted_content: Optional[str] # Content from extraction strategy
# Example for JsonCssExtractionStrategy:
# '[{"title": "Article 1", "date": "2024-03-20"}, ...]'
# Example for LLMExtractionStrategy:
# '{"entities": [...], "relationships": [...]}'
# Additional Information
metadata: Optional[dict] = None # Page metadata
# Example: {
# "title": "Page Title",
# "description": "Meta description",
# "keywords": ["keyword1", "keyword2"],
# "author": "Author Name",
# "published_date": "2024-03-20"
# }
screenshot: Optional[str] = None # Base64 encoded screenshot
# Example: "iVBORw0KGgoAAAANSUhEUgAA..."
error_message: Optional[str] = None # Error message if crawl failed
# Example: "Failed to load page: timeout"
session_id: Optional[str] = None # Session identifier
# Example: "session_123456"
response_headers: Optional[dict] = None # HTTP response headers
# Example: {
# "content-type": "text/html",
# "server": "nginx/1.18.0",
# "date": "Wed, 20 Mar 2024 12:00:00 GMT"
# }
```
### Common Usage Patterns:
1. Basic Content Extraction:
```python
result = await crawler.arun(url="https://example.com")
print(result.markdown) # Clean, readable content
print(result.cleaned_html) # Cleaned HTML
```
2. Media Analysis:
```python
result = await crawler.arun(url="https://example.com")
for image in result.media["images"]:
if image["score"] > 3: # High-relevance images
print(f"High-quality image: {image['src']}")
```
3. Link Analysis:
```python
result = await crawler.arun(url="https://example.com")
internal_links = [link["href"] for link in result.links["internal"]]
external_links = [link["href"] for link in result.links["external"]]
```
4. Structured Data Extraction:
```python
result = await crawler.arun(
url="https://example.com",
extraction_strategy=my_strategy
)
structured_data = json.loads(result.extracted_content)
```
5. Error Handling:
```python
result = await crawler.arun(url="https://example.com")
if not result.success:
print(f"Crawl failed: {result.error_message}")
print(f"Status code: {result.status_code}")
```

View File

@@ -1,67 +0,0 @@
1. **E-commerce Product Monitor**
- Scraping product details from multiple e-commerce sites
- Price tracking with structured data extraction
- Handling dynamic content and anti-bot measures
- Features: JsonCssExtraction, session management, anti-bot
2. **News Aggregator & Summarizer**
- Crawling news websites
- Content extraction and summarization
- Topic classification
- Features: LLMExtraction, CosineStrategy, content cleaning
3. **Academic Paper Research Assistant**
- Crawling research papers from academic sites
- Extracting citations and references
- Building knowledge graphs
- Features: structured extraction, link analysis, chunking
4. **Social Media Content Analyzer**
- Handling JavaScript-heavy sites
- Dynamic content loading
- Sentiment analysis integration
- Features: dynamic content handling, session management
5. **Real Estate Market Analyzer**
- Scraping property listings
- Processing image galleries
- Geolocation data extraction
- Features: media handling, structured data extraction
6. **Documentation Site Generator**
- Recursive website crawling
- Markdown generation
- Link validation
- Features: website crawling, content cleaning
7. **Job Board Aggregator**
- Handling pagination
- Structured job data extraction
- Filtering and categorization
- Features: session management, JsonCssExtraction
8. **Recipe Database Builder**
- Schema-based extraction
- Image processing
- Ingredient parsing
- Features: structured extraction, media handling
9. **Travel Blog Content Analyzer**
- Location extraction
- Image and map processing
- Content categorization
- Features: CosineStrategy, media handling
10. **Technical Documentation Scraper**
- API documentation extraction
- Code snippet processing
- Version tracking
- Features: content cleaning, structured extraction
Each example will include:
- Problem description
- Technical requirements
- Complete implementation
- Error handling
- Output processing
- Performance considerations

View File

@@ -47,7 +47,8 @@
},
"outputs": [],
"source": [
"!pip install crawl4ai\n",
"# !pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git\"\n",
"!pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git@staging\"\n",
"!pip install nest-asyncio\n",
"!playwright install"
]
@@ -713,7 +714,7 @@
"provenance": []
},
"kernelspec": {
"display_name": "venv",
"display_name": "Python 3",
"language": "python",
"name": "python3"
},

View File

@@ -10,7 +10,7 @@ import time
import json
import os
import re
from typing import Dict, List
from typing import Dict
from bs4 import BeautifulSoup
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler
@@ -379,18 +379,6 @@ async def crawl_custom_browser_type():
print(result.markdown[:500])
print("Time taken: ", time.time() - start)
async def crawl_with_user_simultion():
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
url = "YOUR-URL-HERE"
result = await crawler.arun(
url=url,
bypass_cache=True,
simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
override_navigator = True # Overrides the navigator object to make it look like a real user
)
print(result.markdown)
async def speed_comparison():
# print("\n--- Speed Comparison ---")
# print("Firecrawl (simulated):")
@@ -456,57 +444,6 @@ async def speed_comparison():
print("If you run these tests in an environment with better network conditions,")
print("you may observe an even more significant speed advantage for Crawl4AI.")
async def generate_knowledge_graph():
class Entity(BaseModel):
name: str
description: str
class Relationship(BaseModel):
entity1: Entity
entity2: Entity
description: str
relation_type: str
class KnowledgeGraph(BaseModel):
entities: List[Entity]
relationships: List[Relationship]
extraction_strategy = LLMExtractionStrategy(
provider='openai/gpt-4o-mini', # Or any other provider, including Ollama and open source models
api_token=os.getenv('OPENAI_API_KEY'), # In case of Ollama just pass "no-token"
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="""Extract entities and relationships from the given text."""
)
async with AsyncWebCrawler() as crawler:
url = "https://paulgraham.com/love.html"
result = await crawler.arun(
url=url,
bypass_cache=True,
extraction_strategy=extraction_strategy,
# magic=True
)
# print(result.extracted_content)
with open(os.path.join(__location__, "kb.json"), "w") as f:
f.write(result.extracted_content)
async def fit_markdown_remove_overlay():
async with AsyncWebCrawler(headless = False) as crawler:
url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
result = await crawler.arun(
url=url,
bypass_cache=True,
word_count_threshold = 10,
remove_overlay_elements=True,
screenshot = True
)
# Save markdown to file
with open(os.path.join(__location__, "mexico_places.md"), "w") as f:
f.write(result.fit_markdown)
print("Done")
async def main():
await simple_crawl()
await simple_example_with_running_js_code()
@@ -518,7 +455,7 @@ async def main():
# LLM extraction examples
await extract_structured_data_using_llm()
await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
await extract_structured_data_using_llm("openai/gpt-4", os.getenv("OPENAI_API_KEY"))
await extract_structured_data_using_llm("ollama/llama3.2")
# You always can pass custom headers to the extraction strategy

Some files were not shown because too many files have changed in this diff Show More