Compare commits
10 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
aadbcb3481 | ||
|
|
c51e901f68 | ||
|
|
8c611dcb4b | ||
|
|
486db3a771 | ||
|
|
b02544bc0b | ||
|
|
e9639ad189 | ||
|
|
95a4f74d2a | ||
|
|
293f299c08 | ||
|
|
80d58ad24c | ||
|
|
3e83893b3f |
5
.gitignore
vendored
5
.gitignore
vendored
@@ -214,4 +214,7 @@ git_issues.md
|
||||
todo_executor.md
|
||||
protect-all-except-feature.sh
|
||||
manage-collab.sh
|
||||
publish.sh
|
||||
publish.sh
|
||||
|
||||
combine.sh
|
||||
combined_output.txt
|
||||
136
CHANGELOG.md
136
CHANGELOG.md
@@ -1,5 +1,141 @@
|
||||
# Changelog
|
||||
|
||||
## [0.4.1] December 8, 2024
|
||||
|
||||
### **File: `crawl4ai/async_crawler_strategy.py`**
|
||||
|
||||
#### **New Parameters and Attributes Added**
|
||||
- **`text_only` (boolean)**: Enables text-only mode, disables images, JavaScript, and GPU-related features for faster, minimal rendering.
|
||||
- **`light_mode` (boolean)**: Optimizes the browser by disabling unnecessary background processes and features for efficiency.
|
||||
- **`viewport_width` and `viewport_height`**: Dynamically adjusts based on `text_only` mode (default values: 800x600 for `text_only`, 1920x1080 otherwise).
|
||||
- **`extra_args`**: Adds browser-specific flags for `text_only` mode.
|
||||
- **`adjust_viewport_to_content`**: Dynamically adjusts the viewport to the content size for accurate rendering.
|
||||
|
||||
#### **Browser Context Adjustments**
|
||||
- Added **`viewport` adjustments**: Dynamically computed based on `text_only` or custom configuration.
|
||||
- Enhanced support for `light_mode` and `text_only` by adding specific browser arguments to reduce resource consumption.
|
||||
|
||||
#### **Dynamic Content Handling**
|
||||
- **Full Page Scan Feature**:
|
||||
- Scrolls through the entire page while dynamically detecting content changes.
|
||||
- Ensures scrolling stops when no new dynamic content is loaded.
|
||||
|
||||
#### **Session Management**
|
||||
- Added **`create_session`** method:
|
||||
- Creates a new browser session and assigns a unique ID.
|
||||
- Supports persistent and non-persistent contexts with full compatibility for cookies, headers, and proxies.
|
||||
|
||||
#### **Improved Content Loading and Adjustment**
|
||||
- **`adjust_viewport_to_content`**:
|
||||
- Automatically adjusts viewport to match content dimensions.
|
||||
- Includes scaling via Chrome DevTools Protocol (CDP).
|
||||
- Enhanced content loading:
|
||||
- Waits for images to load and ensures network activity is idle before proceeding.
|
||||
|
||||
#### **Error Handling and Logging**
|
||||
- Improved error handling and detailed logging for:
|
||||
- Viewport adjustment (`adjust_viewport_to_content`).
|
||||
- Full page scanning (`scan_full_page`).
|
||||
- Dynamic content loading.
|
||||
|
||||
#### **Refactoring and Cleanup**
|
||||
- Removed hardcoded viewport dimensions in multiple places, replaced with dynamic values (`self.viewport_width`, `self.viewport_height`).
|
||||
- Removed commented-out and unused code for better readability.
|
||||
- Added default value for `delay_before_return_html` parameter.
|
||||
|
||||
#### **Optimizations**
|
||||
- Reduced resource usage in `light_mode` by disabling unnecessary browser features such as extensions, background timers, and sync.
|
||||
- Improved compatibility for different browser types (`chrome`, `firefox`, `webkit`).
|
||||
|
||||
---
|
||||
|
||||
### **File: `docs/examples/quickstart_async.py`**
|
||||
|
||||
#### **Schema Adjustment**
|
||||
- Changed schema reference for `LLMExtractionStrategy`:
|
||||
- **Old**: `OpenAIModelFee.schema()`
|
||||
- **New**: `OpenAIModelFee.model_json_schema()`
|
||||
- This likely ensures better compatibility with the `OpenAIModelFee` class and its JSON schema.
|
||||
|
||||
#### **Documentation Comments Updated**
|
||||
- Improved extraction instruction for schema-based LLM strategies.
|
||||
|
||||
---
|
||||
|
||||
### **New Features Added**
|
||||
1. **Text-Only Mode**:
|
||||
- Focuses on minimal resource usage by disabling non-essential browser features.
|
||||
2. **Light Mode**:
|
||||
- Optimizes browser for performance by disabling background tasks and unnecessary services.
|
||||
3. **Full Page Scanning**:
|
||||
- Ensures the entire content of a page is crawled, including dynamic elements loaded during scrolling.
|
||||
4. **Dynamic Viewport Adjustment**:
|
||||
- Automatically resizes the viewport to match content dimensions, improving compatibility and rendering accuracy.
|
||||
5. **Session Management**:
|
||||
- Simplifies session handling with better support for persistent and non-persistent contexts.
|
||||
|
||||
---
|
||||
|
||||
### **Bug Fixes**
|
||||
- Fixed potential viewport mismatches by ensuring consistent use of `self.viewport_width` and `self.viewport_height` throughout the code.
|
||||
- Improved robustness of dynamic content loading to avoid timeouts and failed evaluations.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## [0.3.75] December 1, 2024
|
||||
|
||||
### PruningContentFilter
|
||||
|
||||
#### 1. Introduced PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
|
||||
A new content filtering strategy that removes less relevant nodes based on metrics like text and link density.
|
||||
|
||||
**Affected Files:**
|
||||
- `crawl4ai/content_filter_strategy.py`: Enhancement of content filtering capabilities.
|
||||
```diff
|
||||
Implemented effective pruning algorithm with comprehensive scoring.
|
||||
```
|
||||
- `README.md`: Improved documentation regarding new features.
|
||||
```diff
|
||||
Updated to include usage and explanation for the PruningContentFilter.
|
||||
```
|
||||
- `docs/md_v2/basic/content_filtering.md`: Expanded documentation for users.
|
||||
```diff
|
||||
Added detailed section explaining the PruningContentFilter.
|
||||
```
|
||||
|
||||
#### 2. Added Unit Tests for PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
|
||||
Comprehensive tests added to ensure correct functionality of PruningContentFilter
|
||||
|
||||
**Affected Files:**
|
||||
- `tests/async/test_content_filter_prune.py`: Increased test coverage for content filtering strategies.
|
||||
```diff
|
||||
Created test cases for various scenarios using the PruningContentFilter.
|
||||
```
|
||||
|
||||
### Development Updates
|
||||
|
||||
#### 3. Enhanced BM25ContentFilter tests (Dec 01, 2024) (Dec 01, 2024)
|
||||
Extended testing to cover additional edge cases and performance metrics.
|
||||
|
||||
**Affected Files:**
|
||||
- `tests/async/test_content_filter_bm25.py`: Improved reliability and performance assurance.
|
||||
```diff
|
||||
Added tests for new extraction scenarios including malformed HTML.
|
||||
```
|
||||
|
||||
### Infrastructure & Documentation
|
||||
|
||||
#### 4. Updated Examples (Dec 01, 2024) (Dec 01, 2024)
|
||||
Altered examples in documentation to promote the use of PruningContentFilter alongside existing strategies.
|
||||
|
||||
**Affected Files:**
|
||||
- `docs/examples/quickstart_async.py`: Enhanced usability and clarity for new users.
|
||||
- Revised example to illustrate usage of PruningContentFilter.
|
||||
|
||||
## [0.3.746] November 29, 2024
|
||||
|
||||
### Major Features
|
||||
|
||||
34
README.md
34
README.md
@@ -11,7 +11,9 @@
|
||||
|
||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||
|
||||
[✨ Check out latest update v0.3.745](#-recent-updates)
|
||||
[✨ Check out latest update v0.4.1](#-recent-updates)
|
||||
|
||||
🎉 **Version 0.4.x is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://crawl4ai.com/mkdocs/blog)
|
||||
|
||||
## 🧐 Why Crawl4AI?
|
||||
|
||||
@@ -77,6 +79,7 @@ if __name__ == "__main__":
|
||||
- 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
|
||||
- ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
|
||||
- 🌍 **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
|
||||
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
|
||||
|
||||
</details>
|
||||
|
||||
@@ -92,6 +95,8 @@ if __name__ == "__main__":
|
||||
- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
|
||||
- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
|
||||
- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
|
||||
- 🕵️ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
|
||||
- 🔄 **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
|
||||
|
||||
</details>
|
||||
|
||||
@@ -118,8 +123,6 @@ if __name__ == "__main__":
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
|
||||
## Try it Now!
|
||||
|
||||
✨ Play around with this [](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
|
||||
@@ -422,7 +425,7 @@ You can check the project structure in the directory [https://github.com/uncleco
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def main():
|
||||
@@ -434,8 +437,11 @@ async def main():
|
||||
url="https://docs.micronaut.io/4.7.6/guide/",
|
||||
cache_mode=CacheMode.ENABLED,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
||||
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
||||
),
|
||||
# markdown_generator=DefaultMarkdownGenerator(
|
||||
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
|
||||
# ),
|
||||
)
|
||||
print(len(result.markdown))
|
||||
print(len(result.fit_markdown))
|
||||
@@ -620,18 +626,22 @@ async def test_news_crawl():
|
||||
|
||||
## ✨ Recent Updates
|
||||
|
||||
- 🚀 **Improved ManagedBrowser Configuration**: Dynamic host and port support for more flexible browser management.
|
||||
- 📝 **Enhanced Markdown Generation**: New generator class for better formatting and customization.
|
||||
- ⚡ **Fast HTML Formatting**: Significantly optimized HTML formatting in the web crawler.
|
||||
- 🛠️ **Utility & Sanitization Upgrades**: Improved sanitization and expanded utility functions for streamlined workflows.
|
||||
- 👥 **Acknowledgments**: Added contributor details and pull request acknowledgments for better transparency.
|
||||
- 🖼️ **Lazy Load Handling**: Improved support for websites with lazy-loaded images. The crawler now waits for all images to fully load, ensuring no content is missed.
|
||||
- ⚡ **Text-Only Mode**: New mode for fast, lightweight crawling. Disables images, JavaScript, and GPU rendering, improving speed by 3-4x for text-focused crawls.
|
||||
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to fit page content, ensuring accurate rendering and capturing of all elements.
|
||||
- 🔄 **Full-Page Scanning**: Added scrolling support for pages with infinite scroll or dynamic content loading. Ensures every part of the page is captured.
|
||||
- 🧑💻 **Session Reuse**: Introduced `create_session` for efficient crawling by reusing the same browser session across multiple requests.
|
||||
- 🌟 **Light Mode**: Optimized browser performance by disabling unnecessary features like extensions, background timers, and sync processes.
|
||||
|
||||
Read the full details of this release in our [0.4.1 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.1.md).
|
||||
|
||||
## 📖 Documentation & Roadmap
|
||||
|
||||
For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
|
||||
> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
|
||||
|
||||
Moreover to check our development plans and upcoming features, check out our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
|
||||
For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
|
||||
|
||||
To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
|
||||
|
||||
<details>
|
||||
<summary>📈 <strong>Development TODOs</strong></summary>
|
||||
|
||||
@@ -1,2 +1,2 @@
|
||||
# crawl4ai/_version.py
|
||||
__version__ = "0.3.746"
|
||||
__version__ = "0.4.1"
|
||||
|
||||
@@ -6,6 +6,8 @@ from typing import Callable, Dict, Any, List, Optional, Awaitable
|
||||
import os, sys, shutil
|
||||
import tempfile, subprocess
|
||||
from playwright.async_api import async_playwright, Page, Browser, Error
|
||||
from playwright.async_api import TimeoutError as PlaywrightTimeoutError
|
||||
from playwright.async_api import TimeoutError as PlaywrightTimeoutError
|
||||
from io import BytesIO
|
||||
from PIL import Image, ImageDraw, ImageFont
|
||||
from pathlib import Path
|
||||
@@ -16,6 +18,7 @@ import json
|
||||
import uuid
|
||||
from .models import AsyncCrawlResponse
|
||||
from .utils import create_box_message
|
||||
from .user_agent_generator import UserAgentGenerator
|
||||
from playwright_stealth import StealthConfig, stealth_async
|
||||
|
||||
stealth_config = StealthConfig(
|
||||
@@ -218,18 +221,39 @@ class AsyncCrawlerStrategy(ABC):
|
||||
|
||||
class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
def __init__(self, use_cached_html=False, js_code=None, logger = None, **kwargs):
|
||||
self.text_only = kwargs.get("text_only", False)
|
||||
self.light_mode = kwargs.get("light_mode", False)
|
||||
self.logger = logger
|
||||
self.use_cached_html = use_cached_html
|
||||
self.viewport_width = kwargs.get("viewport_width", 800 if self.text_only else 1920)
|
||||
self.viewport_height = kwargs.get("viewport_height", 600 if self.text_only else 1080)
|
||||
|
||||
if self.text_only:
|
||||
self.extra_args = kwargs.get("extra_args", []) + [
|
||||
'--disable-images',
|
||||
'--disable-javascript',
|
||||
'--disable-gpu',
|
||||
'--disable-software-rasterizer',
|
||||
'--disable-dev-shm-usage'
|
||||
]
|
||||
|
||||
self.user_agent = kwargs.get(
|
||||
"user_agent",
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
|
||||
# "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
|
||||
"Mozilla/5.0 (Linux; Android 11; SM-G973F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.120 Mobile Safari/537.36"
|
||||
)
|
||||
user_agenr_generator = UserAgentGenerator()
|
||||
if kwargs.get("user_agent_mode") == "random":
|
||||
self.user_agent = user_agenr_generator.generate(
|
||||
**kwargs.get("user_agent_generator_config", {})
|
||||
)
|
||||
self.proxy = kwargs.get("proxy")
|
||||
self.proxy_config = kwargs.get("proxy_config")
|
||||
self.headless = kwargs.get("headless", True)
|
||||
self.browser_type = kwargs.get("browser_type", "chromium")
|
||||
self.headers = kwargs.get("headers", {})
|
||||
self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent)
|
||||
self.headers.setdefault("sec-ch-ua", self.browser_hint)
|
||||
self.cookies = kwargs.get("cookies", [])
|
||||
self.sessions = {}
|
||||
self.session_ttl = 1800
|
||||
@@ -291,7 +315,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
else:
|
||||
# If no default context exists, create one
|
||||
self.default_context = await self.browser.new_context(
|
||||
viewport={"width": 1920, "height": 1080}
|
||||
# viewport={"width": 1920, "height": 1080}
|
||||
viewport={"width": self.viewport_width, "height": self.viewport_height}
|
||||
)
|
||||
|
||||
# Set up the default context
|
||||
@@ -307,7 +332,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
if self.user_agent:
|
||||
await self.default_context.set_extra_http_headers({
|
||||
"User-Agent": self.user_agent
|
||||
"User-Agent": self.user_agent,
|
||||
"sec-ch-ua": self.browser_hint,
|
||||
# **self.headers
|
||||
})
|
||||
else:
|
||||
# Base browser arguments
|
||||
@@ -321,10 +348,42 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
"--disable-infobars",
|
||||
"--window-position=0,0",
|
||||
"--ignore-certificate-errors",
|
||||
"--ignore-certificate-errors-spki-list"
|
||||
"--ignore-certificate-errors-spki-list",
|
||||
"--disable-blink-features=AutomationControlled",
|
||||
"--window-position=400,0",
|
||||
f"--window-size={self.viewport_width},{self.viewport_height}",
|
||||
]
|
||||
}
|
||||
|
||||
if self.light_mode:
|
||||
browser_args["args"].extend([
|
||||
# "--disable-background-networking",
|
||||
"--disable-background-timer-throttling",
|
||||
"--disable-backgrounding-occluded-windows",
|
||||
"--disable-breakpad",
|
||||
"--disable-client-side-phishing-detection",
|
||||
"--disable-component-extensions-with-background-pages",
|
||||
"--disable-default-apps",
|
||||
"--disable-extensions",
|
||||
"--disable-features=TranslateUI",
|
||||
"--disable-hang-monitor",
|
||||
"--disable-ipc-flooding-protection",
|
||||
"--disable-popup-blocking",
|
||||
"--disable-prompt-on-repost",
|
||||
"--disable-sync",
|
||||
"--force-color-profile=srgb",
|
||||
"--metrics-recording-only",
|
||||
"--no-first-run",
|
||||
"--password-store=basic",
|
||||
"--use-mock-keychain"
|
||||
])
|
||||
|
||||
if self.text_only:
|
||||
browser_args["args"].extend([
|
||||
'--blink-settings=imagesEnabled=false',
|
||||
'--disable-remote-fonts'
|
||||
])
|
||||
|
||||
# Add channel if specified (try Chrome first)
|
||||
if self.chrome_channel:
|
||||
browser_args["channel"] = self.chrome_channel
|
||||
@@ -354,6 +413,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if self.browser_type == "firefox":
|
||||
self.browser = await self.playwright.firefox.launch(**browser_args)
|
||||
elif self.browser_type == "webkit":
|
||||
if "viewport" not in browser_args:
|
||||
browser_args["viewport"] = {"width": self.viewport_width, "height": self.viewport_height}
|
||||
self.browser = await self.playwright.webkit.launch(**browser_args)
|
||||
else:
|
||||
if self.use_persistent_context and self.user_data_dir:
|
||||
@@ -563,6 +624,38 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
# Return the page object
|
||||
return page
|
||||
|
||||
async def create_session(self, **kwargs) -> str:
|
||||
"""Creates a new browser session and returns its ID."""
|
||||
if not self.browser:
|
||||
await self.start()
|
||||
|
||||
session_id = kwargs.get('session_id') or str(uuid.uuid4())
|
||||
|
||||
if self.use_managed_browser:
|
||||
page = await self.default_context.new_page()
|
||||
self.sessions[session_id] = (self.default_context, page, time.time())
|
||||
else:
|
||||
if self.use_persistent_context and self.browser_type in ["chrome", "chromium"]:
|
||||
context = self.browser
|
||||
page = await context.new_page()
|
||||
else:
|
||||
context = await self.browser.new_context(
|
||||
user_agent=kwargs.get("user_agent", self.user_agent),
|
||||
viewport={"width": self.viewport_width, "height": self.viewport_height},
|
||||
proxy={"server": self.proxy} if self.proxy else None,
|
||||
accept_downloads=self.accept_downloads,
|
||||
ignore_https_errors=True
|
||||
)
|
||||
|
||||
if self.cookies:
|
||||
await context.add_cookies(self.cookies)
|
||||
await context.set_extra_http_headers(self.headers)
|
||||
page = await context.new_page()
|
||||
|
||||
self.sessions[session_id] = (context, page, time.time())
|
||||
|
||||
return session_id
|
||||
|
||||
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
|
||||
"""
|
||||
Crawls a given URL or processes raw HTML/local file content based on the URL prefix.
|
||||
@@ -642,6 +735,15 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
self._cleanup_expired_sessions()
|
||||
session_id = kwargs.get("session_id")
|
||||
|
||||
# Check if in kwargs we have user_agent that will override the default user_agent
|
||||
user_agent = kwargs.get("user_agent", self.user_agent)
|
||||
|
||||
# Generate random user agent if magic mode is enabled and user_agent_mode is not random
|
||||
if kwargs.get("user_agent_mode") != "random" and kwargs.get("magic", False):
|
||||
user_agent = UserAgentGenerator().generate(
|
||||
**kwargs.get("user_agent_generator_config", {})
|
||||
)
|
||||
|
||||
# Handle page creation differently for managed browser
|
||||
context = None
|
||||
if self.use_managed_browser:
|
||||
@@ -662,12 +764,11 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if self.use_persistent_context and self.browser_type in ["chrome", "chromium"]:
|
||||
# In persistent context, browser is the context
|
||||
context = self.browser
|
||||
page = await context.new_page()
|
||||
else:
|
||||
# Normal context creation for non-persistent or non-Chrome browsers
|
||||
context = await self.browser.new_context(
|
||||
user_agent=self.user_agent,
|
||||
viewport={"width": 1200, "height": 800},
|
||||
user_agent=user_agent,
|
||||
viewport={"width": self.viewport_width, "height": self.viewport_height},
|
||||
proxy={"server": self.proxy} if self.proxy else None,
|
||||
java_script_enabled=True,
|
||||
accept_downloads=self.accept_downloads,
|
||||
@@ -677,7 +778,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if self.cookies:
|
||||
await context.add_cookies(self.cookies)
|
||||
await context.set_extra_http_headers(self.headers)
|
||||
page = await context.new_page()
|
||||
|
||||
page = await context.new_page()
|
||||
self.sessions[session_id] = (context, page, time.time())
|
||||
else:
|
||||
if self.use_persistent_context and self.browser_type in ["chrome", "chromium"]:
|
||||
@@ -686,10 +788,12 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
else:
|
||||
# Normal context creation
|
||||
context = await self.browser.new_context(
|
||||
user_agent=self.user_agent,
|
||||
viewport={"width": 1920, "height": 1080},
|
||||
user_agent=user_agent,
|
||||
# viewport={"width": 1920, "height": 1080},
|
||||
viewport={"width": self.viewport_width, "height": self.viewport_height},
|
||||
proxy={"server": self.proxy} if self.proxy else None,
|
||||
accept_downloads=self.accept_downloads,
|
||||
ignore_https_errors=True # Add this line
|
||||
)
|
||||
if self.cookies:
|
||||
await context.add_cookies(self.cookies)
|
||||
@@ -740,9 +844,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if self.accept_downloads:
|
||||
page.on("download", lambda download: asyncio.create_task(self._handle_download(download)))
|
||||
|
||||
# if self.verbose:
|
||||
# print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
|
||||
|
||||
if self.use_cached_html:
|
||||
cache_file_path = os.path.join(
|
||||
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
|
||||
@@ -763,7 +864,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
if not kwargs.get("js_only", False):
|
||||
await self.execute_hook('before_goto', page, context = context)
|
||||
|
||||
|
||||
try:
|
||||
response = await page.goto(
|
||||
@@ -775,9 +875,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
except Error as e:
|
||||
raise RuntimeError(f"Failed on navigating ACS-GOTO :\n{str(e)}")
|
||||
|
||||
# response = await page.goto("about:blank")
|
||||
# await page.evaluate(f"window.location.href = '{url}'")
|
||||
|
||||
await self.execute_hook('after_goto', page, context = context)
|
||||
|
||||
# Get status code and headers
|
||||
@@ -830,7 +927,87 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
else:
|
||||
raise Error(f"Body element is hidden: {visibility_info}")
|
||||
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
# CONTENT LOADING ASSURANCE
|
||||
if not self.text_only and (kwargs.get("wait_for_images", True) or kwargs.get("adjust_viewport_to_content", False)):
|
||||
# Wait for network idle after initial load and images to load
|
||||
await page.wait_for_load_state("networkidle")
|
||||
await asyncio.sleep(0.1)
|
||||
try:
|
||||
await page.wait_for_function("Array.from(document.images).every(img => img.complete)", timeout=1000)
|
||||
# Check for TimeoutError and ignore it
|
||||
except PlaywrightTimeoutError:
|
||||
pass
|
||||
|
||||
# After initial load, adjust viewport to content size
|
||||
if not self.text_only and kwargs.get("adjust_viewport_to_content", False):
|
||||
try:
|
||||
# Get actual page dimensions
|
||||
page_width = await page.evaluate("document.documentElement.scrollWidth")
|
||||
page_height = await page.evaluate("document.documentElement.scrollHeight")
|
||||
|
||||
target_width = self.viewport_width
|
||||
target_height = int(target_width * page_width / page_height * 0.95)
|
||||
await page.set_viewport_size({"width": target_width, "height": target_height})
|
||||
|
||||
# Compute scale factor
|
||||
# We want the entire page visible: the scale should make both width and height fit
|
||||
scale = min(target_width / page_width, target_height / page_height)
|
||||
|
||||
# Now we call CDP to set metrics.
|
||||
# We tell Chrome that the "device" is page_width x page_height in size,
|
||||
# but we scale it down so everything fits within the real viewport.
|
||||
cdp = await page.context.new_cdp_session(page)
|
||||
await cdp.send('Emulation.setDeviceMetricsOverride', {
|
||||
'width': page_width, # full page width
|
||||
'height': page_height, # full page height
|
||||
'deviceScaleFactor': 1, # keep normal DPR
|
||||
'mobile': False,
|
||||
'scale': scale # scale the entire rendered content
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(
|
||||
message="Failed to adjust viewport to content: {error}",
|
||||
tag="VIEWPORT",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
|
||||
# After viewport adjustment, handle page scanning if requested
|
||||
if kwargs.get("scan_full_page", False):
|
||||
try:
|
||||
viewport_height = page.viewport_size.get("height", self.viewport_height)
|
||||
current_position = viewport_height # Start with one viewport height
|
||||
scroll_delay = kwargs.get("scroll_delay", 0.2)
|
||||
|
||||
# Initial scroll
|
||||
await page.evaluate(f"window.scrollTo(0, {current_position})")
|
||||
await asyncio.sleep(scroll_delay)
|
||||
|
||||
# Get height after first scroll to account for any dynamic content
|
||||
total_height = await page.evaluate("document.documentElement.scrollHeight")
|
||||
|
||||
while current_position < total_height:
|
||||
current_position = min(current_position + viewport_height, total_height)
|
||||
await page.evaluate(f"window.scrollTo(0, {current_position})")
|
||||
await asyncio.sleep(scroll_delay)
|
||||
|
||||
# Check for dynamic content
|
||||
new_height = await page.evaluate("document.documentElement.scrollHeight")
|
||||
if new_height > total_height:
|
||||
total_height = new_height
|
||||
|
||||
# Scroll back to top
|
||||
await page.evaluate("window.scrollTo(0, 0)")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(
|
||||
message="Failed to perform full page scan: {error}",
|
||||
tag="PAGE_SCAN",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
else:
|
||||
# Scroll to the bottom of the page
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
|
||||
js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
|
||||
if js_code:
|
||||
@@ -864,7 +1041,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
# await page.wait_for_load_state('networkidle', timeout=5000)
|
||||
|
||||
# Update image dimensions
|
||||
update_image_dimensions_js = """
|
||||
if not self.text_only:
|
||||
update_image_dimensions_js = """
|
||||
() => {
|
||||
return new Promise((resolve) => {
|
||||
const filterImage = (img) => {
|
||||
@@ -920,14 +1098,27 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
});
|
||||
}
|
||||
"""
|
||||
try:
|
||||
await page.wait_for_load_state()
|
||||
await page.evaluate(update_image_dimensions_js)
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {str(e)}")
|
||||
|
||||
try:
|
||||
try:
|
||||
await page.wait_for_load_state(
|
||||
# state="load",
|
||||
state="domcontentloaded",
|
||||
timeout=5
|
||||
)
|
||||
except PlaywrightTimeoutError:
|
||||
pass
|
||||
await page.evaluate(update_image_dimensions_js)
|
||||
except Exception as e:
|
||||
self.logger.error(
|
||||
message="Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {error}",
|
||||
tag="ERROR",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
# raise RuntimeError(f"Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {str(e)}")
|
||||
|
||||
# Wait a bit for any onload events to complete
|
||||
await page.wait_for_timeout(100)
|
||||
# await page.wait_for_timeout(100)
|
||||
|
||||
# Process iframes
|
||||
if kwargs.get("process_iframes", False):
|
||||
@@ -935,7 +1126,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
await self.execute_hook('before_retrieve_html', page, context = context)
|
||||
# Check if delay_before_return_html is set then wait for that time
|
||||
delay_before_return_html = kwargs.get("delay_before_return_html")
|
||||
delay_before_return_html = kwargs.get("delay_before_return_html", 0.1)
|
||||
if delay_before_return_html:
|
||||
await asyncio.sleep(delay_before_return_html)
|
||||
|
||||
|
||||
@@ -7,6 +7,7 @@ from pathlib import Path
|
||||
from typing import Optional, List, Union
|
||||
import json
|
||||
import asyncio
|
||||
from contextlib import nullcontext
|
||||
from .models import CrawlResult, MarkdownGenerationResult
|
||||
from .async_database import async_db_manager
|
||||
from .chunking_strategy import *
|
||||
@@ -67,6 +68,7 @@ class AsyncWebCrawler:
|
||||
always_bypass_cache: bool = False,
|
||||
always_by_pass_cache: Optional[bool] = None, # Deprecated parameter
|
||||
base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())),
|
||||
thread_safe: bool = False,
|
||||
**kwargs,
|
||||
):
|
||||
"""
|
||||
@@ -104,6 +106,8 @@ class AsyncWebCrawler:
|
||||
else:
|
||||
self.always_bypass_cache = always_bypass_cache
|
||||
|
||||
self._lock = asyncio.Lock() if thread_safe else None
|
||||
|
||||
self.crawl4ai_folder = os.path.join(base_directory, ".crawl4ai")
|
||||
os.makedirs(self.crawl4ai_folder, exist_ok=True)
|
||||
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
|
||||
@@ -178,169 +182,170 @@ class AsyncWebCrawler:
|
||||
Returns:
|
||||
CrawlResult: The result of crawling and processing
|
||||
"""
|
||||
try:
|
||||
# Handle deprecated parameters
|
||||
if any([bypass_cache, disable_cache, no_cache_read, no_cache_write]):
|
||||
if kwargs.get("warning", True):
|
||||
warnings.warn(
|
||||
"Cache control boolean flags are deprecated and will be removed in version X.X.X. "
|
||||
"Use 'cache_mode' parameter instead. Examples:\n"
|
||||
"- For bypass_cache=True, use cache_mode=CacheMode.BYPASS\n"
|
||||
"- For disable_cache=True, use cache_mode=CacheMode.DISABLED\n"
|
||||
"- For no_cache_read=True, use cache_mode=CacheMode.WRITE_ONLY\n"
|
||||
"- For no_cache_write=True, use cache_mode=CacheMode.READ_ONLY\n"
|
||||
"Pass warning=False to suppress this warning.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2
|
||||
)
|
||||
async with self._lock or nullcontext():
|
||||
try:
|
||||
# Handle deprecated parameters
|
||||
if any([bypass_cache, disable_cache, no_cache_read, no_cache_write]):
|
||||
if kwargs.get("warning", True):
|
||||
warnings.warn(
|
||||
"Cache control boolean flags are deprecated and will be removed in version X.X.X. "
|
||||
"Use 'cache_mode' parameter instead. Examples:\n"
|
||||
"- For bypass_cache=True, use cache_mode=CacheMode.BYPASS\n"
|
||||
"- For disable_cache=True, use cache_mode=CacheMode.DISABLED\n"
|
||||
"- For no_cache_read=True, use cache_mode=CacheMode.WRITE_ONLY\n"
|
||||
"- For no_cache_write=True, use cache_mode=CacheMode.READ_ONLY\n"
|
||||
"Pass warning=False to suppress this warning.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2
|
||||
)
|
||||
|
||||
# Convert legacy parameters if cache_mode not provided
|
||||
if cache_mode is None:
|
||||
cache_mode = _legacy_to_cache_mode(
|
||||
disable_cache=disable_cache,
|
||||
bypass_cache=bypass_cache,
|
||||
no_cache_read=no_cache_read,
|
||||
no_cache_write=no_cache_write
|
||||
)
|
||||
|
||||
# Convert legacy parameters if cache_mode not provided
|
||||
# Default to ENABLED if no cache mode specified
|
||||
if cache_mode is None:
|
||||
cache_mode = _legacy_to_cache_mode(
|
||||
disable_cache=disable_cache,
|
||||
bypass_cache=bypass_cache,
|
||||
no_cache_read=no_cache_read,
|
||||
no_cache_write=no_cache_write
|
||||
cache_mode = CacheMode.ENABLED
|
||||
|
||||
# Create cache context
|
||||
cache_context = CacheContext(url, cache_mode, self.always_bypass_cache)
|
||||
|
||||
extraction_strategy = extraction_strategy or NoExtractionStrategy()
|
||||
extraction_strategy.verbose = verbose
|
||||
if not isinstance(extraction_strategy, ExtractionStrategy):
|
||||
raise ValueError("Unsupported extraction strategy")
|
||||
if not isinstance(chunking_strategy, ChunkingStrategy):
|
||||
raise ValueError("Unsupported chunking strategy")
|
||||
|
||||
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
|
||||
|
||||
async_response: AsyncCrawlResponse = None
|
||||
cached_result = None
|
||||
screenshot_data = None
|
||||
extracted_content = None
|
||||
|
||||
start_time = time.perf_counter()
|
||||
|
||||
# Try to get cached result if appropriate
|
||||
if cache_context.should_read():
|
||||
cached_result = await async_db_manager.aget_cached_url(url)
|
||||
|
||||
if cached_result:
|
||||
html = sanitize_input_encode(cached_result.html)
|
||||
extracted_content = sanitize_input_encode(cached_result.extracted_content or "")
|
||||
if screenshot:
|
||||
screenshot_data = cached_result.screenshot
|
||||
if not screenshot_data:
|
||||
cached_result = None
|
||||
# if verbose:
|
||||
# print(f"{Fore.BLUE}{self.tag_format('FETCH')} {self.log_icons['FETCH']} Cache hit for {cache_context.display_url} | Status: {Fore.GREEN if bool(html) else Fore.RED}{bool(html)}{Style.RESET_ALL} | Time: {time.perf_counter() - start_time:.2f}s")
|
||||
self.logger.url_status(
|
||||
url=cache_context.display_url,
|
||||
success=bool(html),
|
||||
timing=time.perf_counter() - start_time,
|
||||
tag="FETCH"
|
||||
)
|
||||
|
||||
|
||||
# Fetch fresh content if needed
|
||||
if not cached_result or not html:
|
||||
t1 = time.perf_counter()
|
||||
|
||||
if user_agent:
|
||||
self.crawler_strategy.update_user_agent(user_agent)
|
||||
async_response: AsyncCrawlResponse = await self.crawler_strategy.crawl(
|
||||
url,
|
||||
screenshot=screenshot,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
# Default to ENABLED if no cache mode specified
|
||||
if cache_mode is None:
|
||||
cache_mode = CacheMode.ENABLED
|
||||
|
||||
# Create cache context
|
||||
cache_context = CacheContext(url, cache_mode, self.always_bypass_cache)
|
||||
|
||||
extraction_strategy = extraction_strategy or NoExtractionStrategy()
|
||||
extraction_strategy.verbose = verbose
|
||||
if not isinstance(extraction_strategy, ExtractionStrategy):
|
||||
raise ValueError("Unsupported extraction strategy")
|
||||
if not isinstance(chunking_strategy, ChunkingStrategy):
|
||||
raise ValueError("Unsupported chunking strategy")
|
||||
|
||||
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
|
||||
|
||||
async_response: AsyncCrawlResponse = None
|
||||
cached_result = None
|
||||
screenshot_data = None
|
||||
extracted_content = None
|
||||
|
||||
start_time = time.perf_counter()
|
||||
|
||||
# Try to get cached result if appropriate
|
||||
if cache_context.should_read():
|
||||
cached_result = await async_db_manager.aget_cached_url(url)
|
||||
|
||||
if cached_result:
|
||||
html = sanitize_input_encode(cached_result.html)
|
||||
extracted_content = sanitize_input_encode(cached_result.extracted_content or "")
|
||||
if screenshot:
|
||||
screenshot_data = cached_result.screenshot
|
||||
if not screenshot_data:
|
||||
cached_result = None
|
||||
# if verbose:
|
||||
# print(f"{Fore.BLUE}{self.tag_format('FETCH')} {self.log_icons['FETCH']} Cache hit for {cache_context.display_url} | Status: {Fore.GREEN if bool(html) else Fore.RED}{bool(html)}{Style.RESET_ALL} | Time: {time.perf_counter() - start_time:.2f}s")
|
||||
self.logger.url_status(
|
||||
html = sanitize_input_encode(async_response.html)
|
||||
screenshot_data = async_response.screenshot
|
||||
t2 = time.perf_counter()
|
||||
self.logger.url_status(
|
||||
url=cache_context.display_url,
|
||||
success=bool(html),
|
||||
timing=time.perf_counter() - start_time,
|
||||
timing=t2 - t1,
|
||||
tag="FETCH"
|
||||
)
|
||||
)
|
||||
# if verbose:
|
||||
# print(f"{Fore.BLUE}{self.tag_format('FETCH')} {self.log_icons['FETCH']} Live fetch for {cache_context.display_url}... | Status: {Fore.GREEN if bool(html) else Fore.RED}{bool(html)}{Style.RESET_ALL} | Time: {t2 - t1:.2f}s")
|
||||
|
||||
|
||||
# Fetch fresh content if needed
|
||||
if not cached_result or not html:
|
||||
t1 = time.perf_counter()
|
||||
# Process the HTML content
|
||||
crawl_result = await self.aprocess_html(
|
||||
url=url,
|
||||
html=html,
|
||||
extracted_content=extracted_content,
|
||||
word_count_threshold=word_count_threshold,
|
||||
extraction_strategy=extraction_strategy,
|
||||
chunking_strategy=chunking_strategy,
|
||||
content_filter=content_filter,
|
||||
css_selector=css_selector,
|
||||
screenshot=screenshot_data,
|
||||
verbose=verbose,
|
||||
is_cached=bool(cached_result),
|
||||
async_response=async_response,
|
||||
is_web_url=cache_context.is_web_url,
|
||||
is_local_file=cache_context.is_local_file,
|
||||
is_raw_html=cache_context.is_raw_html,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
if user_agent:
|
||||
self.crawler_strategy.update_user_agent(user_agent)
|
||||
async_response: AsyncCrawlResponse = await self.crawler_strategy.crawl(
|
||||
url,
|
||||
screenshot=screenshot,
|
||||
**kwargs
|
||||
)
|
||||
html = sanitize_input_encode(async_response.html)
|
||||
screenshot_data = async_response.screenshot
|
||||
t2 = time.perf_counter()
|
||||
self.logger.url_status(
|
||||
url=cache_context.display_url,
|
||||
success=bool(html),
|
||||
timing=t2 - t1,
|
||||
tag="FETCH"
|
||||
)
|
||||
# Set response data
|
||||
if async_response:
|
||||
crawl_result.status_code = async_response.status_code
|
||||
crawl_result.response_headers = async_response.response_headers
|
||||
crawl_result.downloaded_files = async_response.downloaded_files
|
||||
else:
|
||||
crawl_result.status_code = 200
|
||||
crawl_result.response_headers = cached_result.response_headers if cached_result else {}
|
||||
|
||||
crawl_result.success = bool(html)
|
||||
crawl_result.session_id = kwargs.get("session_id", None)
|
||||
|
||||
# if verbose:
|
||||
# print(f"{Fore.BLUE}{self.tag_format('FETCH')} {self.log_icons['FETCH']} Live fetch for {cache_context.display_url}... | Status: {Fore.GREEN if bool(html) else Fore.RED}{bool(html)}{Style.RESET_ALL} | Time: {t2 - t1:.2f}s")
|
||||
# print(f"{Fore.GREEN}{self.tag_format('COMPLETE')} {self.log_icons['COMPLETE']} {cache_context.display_url[:URL_LOG_SHORTEN_LENGTH]}... | Status: {Fore.GREEN if crawl_result.success else Fore.RED}{crawl_result.success} | {Fore.YELLOW}Total: {time.perf_counter() - start_time:.2f}s{Style.RESET_ALL}")
|
||||
self.logger.success(
|
||||
message="{url:.50}... | Status: {status} | Total: {timing}",
|
||||
tag="COMPLETE",
|
||||
params={
|
||||
"url": cache_context.display_url,
|
||||
"status": crawl_result.success,
|
||||
"timing": f"{time.perf_counter() - start_time:.2f}s"
|
||||
},
|
||||
colors={
|
||||
"status": Fore.GREEN if crawl_result.success else Fore.RED,
|
||||
"timing": Fore.YELLOW
|
||||
}
|
||||
)
|
||||
|
||||
# Process the HTML content
|
||||
crawl_result = await self.aprocess_html(
|
||||
url=url,
|
||||
html=html,
|
||||
extracted_content=extracted_content,
|
||||
word_count_threshold=word_count_threshold,
|
||||
extraction_strategy=extraction_strategy,
|
||||
chunking_strategy=chunking_strategy,
|
||||
content_filter=content_filter,
|
||||
css_selector=css_selector,
|
||||
screenshot=screenshot_data,
|
||||
verbose=verbose,
|
||||
is_cached=bool(cached_result),
|
||||
async_response=async_response,
|
||||
is_web_url=cache_context.is_web_url,
|
||||
is_local_file=cache_context.is_local_file,
|
||||
is_raw_html=cache_context.is_raw_html,
|
||||
**kwargs,
|
||||
)
|
||||
# Update cache if appropriate
|
||||
if cache_context.should_write() and not bool(cached_result):
|
||||
await async_db_manager.acache_url(crawl_result)
|
||||
|
||||
return crawl_result
|
||||
|
||||
# Set response data
|
||||
if async_response:
|
||||
crawl_result.status_code = async_response.status_code
|
||||
crawl_result.response_headers = async_response.response_headers
|
||||
crawl_result.downloaded_files = async_response.downloaded_files
|
||||
else:
|
||||
crawl_result.status_code = 200
|
||||
crawl_result.response_headers = cached_result.response_headers if cached_result else {}
|
||||
|
||||
crawl_result.success = bool(html)
|
||||
crawl_result.session_id = kwargs.get("session_id", None)
|
||||
|
||||
# if verbose:
|
||||
# print(f"{Fore.GREEN}{self.tag_format('COMPLETE')} {self.log_icons['COMPLETE']} {cache_context.display_url[:URL_LOG_SHORTEN_LENGTH]}... | Status: {Fore.GREEN if crawl_result.success else Fore.RED}{crawl_result.success} | {Fore.YELLOW}Total: {time.perf_counter() - start_time:.2f}s{Style.RESET_ALL}")
|
||||
self.logger.success(
|
||||
message="{url:.50}... | Status: {status} | Total: {timing}",
|
||||
tag="COMPLETE",
|
||||
params={
|
||||
"url": cache_context.display_url,
|
||||
"status": crawl_result.success,
|
||||
"timing": f"{time.perf_counter() - start_time:.2f}s"
|
||||
},
|
||||
colors={
|
||||
"status": Fore.GREEN if crawl_result.success else Fore.RED,
|
||||
"timing": Fore.YELLOW
|
||||
}
|
||||
except Exception as e:
|
||||
if not hasattr(e, "msg"):
|
||||
e.msg = str(e)
|
||||
# print(f"{Fore.RED}{self.tag_format('ERROR')} {self.log_icons['ERROR']} Failed to crawl {cache_context.display_url[:URL_LOG_SHORTEN_LENGTH]}... | {e.msg}{Style.RESET_ALL}")
|
||||
|
||||
self.logger.error_status(
|
||||
url=cache_context.display_url,
|
||||
error=create_box_message(e.msg, type = "error"),
|
||||
tag="ERROR"
|
||||
)
|
||||
return CrawlResult(
|
||||
url=url,
|
||||
html="",
|
||||
success=False,
|
||||
error_message=e.msg
|
||||
)
|
||||
|
||||
# Update cache if appropriate
|
||||
if cache_context.should_write() and not bool(cached_result):
|
||||
await async_db_manager.acache_url(crawl_result)
|
||||
|
||||
return crawl_result
|
||||
|
||||
except Exception as e:
|
||||
if not hasattr(e, "msg"):
|
||||
e.msg = str(e)
|
||||
# print(f"{Fore.RED}{self.tag_format('ERROR')} {self.log_icons['ERROR']} Failed to crawl {cache_context.display_url[:URL_LOG_SHORTEN_LENGTH]}... | {e.msg}{Style.RESET_ALL}")
|
||||
|
||||
self.logger.error_status(
|
||||
url=cache_context.display_url,
|
||||
error=create_box_message(e.msg, type = "error"),
|
||||
tag="ERROR"
|
||||
)
|
||||
return CrawlResult(
|
||||
url=url,
|
||||
html="",
|
||||
success=False,
|
||||
error_message=e.msg
|
||||
)
|
||||
|
||||
async def arun_many(
|
||||
self,
|
||||
urls: List[str],
|
||||
@@ -472,7 +477,9 @@ class AsyncWebCrawler:
|
||||
try:
|
||||
_url = url if not kwargs.get("is_raw_html", False) else "Raw HTML"
|
||||
t1 = time.perf_counter()
|
||||
scrapping_strategy = WebScrapingStrategy()
|
||||
scrapping_strategy = WebScrapingStrategy(
|
||||
logger=self.logger,
|
||||
)
|
||||
# result = await scrapping_strategy.ascrap(
|
||||
result = scrapping_strategy.scrap(
|
||||
url,
|
||||
|
||||
@@ -4,10 +4,10 @@ from typing import List, Tuple, Dict
|
||||
from rank_bm25 import BM25Okapi
|
||||
from time import perf_counter
|
||||
from collections import deque
|
||||
from bs4 import BeautifulSoup, NavigableString, Tag
|
||||
from bs4 import BeautifulSoup, NavigableString, Tag, Comment
|
||||
from .utils import clean_tokens
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
import math
|
||||
from snowballstemmer import stemmer
|
||||
|
||||
|
||||
@@ -358,145 +358,186 @@ class BM25ContentFilter(RelevantContentFilter):
|
||||
return [self.clean_element(tag) for _, _, tag in selected_candidates]
|
||||
|
||||
|
||||
class HeuristicContentFilter(RelevantContentFilter):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
# Weights for different heuristics
|
||||
self.tag_weights = {
|
||||
'article': 10,
|
||||
'main': 8,
|
||||
'section': 5,
|
||||
'div': 3,
|
||||
'p': 2,
|
||||
'pre': 2,
|
||||
'code': 2,
|
||||
'blockquote': 2,
|
||||
'li': 1,
|
||||
'span': 1,
|
||||
}
|
||||
self.max_depth = 5 # Maximum depth from body to consider
|
||||
|
||||
def filter_content(self, html: str) -> List[str]:
|
||||
"""Implements heuristic content filtering without relying on a query."""
|
||||
|
||||
|
||||
|
||||
class PruningContentFilter(RelevantContentFilter):
|
||||
def __init__(self, user_query: str = None, min_word_threshold: int = None,
|
||||
threshold_type: str = 'fixed', threshold: float = 0.48):
|
||||
super().__init__(user_query)
|
||||
self.min_word_threshold = min_word_threshold
|
||||
self.threshold_type = threshold_type
|
||||
self.threshold = threshold
|
||||
|
||||
# Add tag importance for dynamic threshold
|
||||
self.tag_importance = {
|
||||
'article': 1.5,
|
||||
'main': 1.4,
|
||||
'section': 1.3,
|
||||
'p': 1.2,
|
||||
'h1': 1.4,
|
||||
'h2': 1.3,
|
||||
'h3': 1.2,
|
||||
'div': 0.7,
|
||||
'span': 0.6
|
||||
}
|
||||
|
||||
# Metric configuration
|
||||
self.metric_config = {
|
||||
'text_density': True,
|
||||
'link_density': True,
|
||||
'tag_weight': True,
|
||||
'class_id_weight': True,
|
||||
'text_length': True,
|
||||
}
|
||||
|
||||
self.metric_weights = {
|
||||
'text_density': 0.4,
|
||||
'link_density': 0.2,
|
||||
'tag_weight': 0.2,
|
||||
'class_id_weight': 0.1,
|
||||
'text_length': 0.1,
|
||||
}
|
||||
|
||||
self.tag_weights = {
|
||||
'div': 0.5,
|
||||
'p': 1.0,
|
||||
'article': 1.5,
|
||||
'section': 1.0,
|
||||
'span': 0.3,
|
||||
'li': 0.5,
|
||||
'ul': 0.5,
|
||||
'ol': 0.5,
|
||||
'h1': 1.2,
|
||||
'h2': 1.1,
|
||||
'h3': 1.0,
|
||||
'h4': 0.9,
|
||||
'h5': 0.8,
|
||||
'h6': 0.7,
|
||||
}
|
||||
|
||||
def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
|
||||
if not html or not isinstance(html, str):
|
||||
return []
|
||||
|
||||
|
||||
soup = BeautifulSoup(html, 'lxml')
|
||||
|
||||
# Ensure there is a body tag
|
||||
if not soup.body:
|
||||
soup = BeautifulSoup(f'<body>{html}</body>', 'lxml')
|
||||
body = soup.body
|
||||
|
||||
# Remove comments and unwanted tags
|
||||
self._remove_comments(soup)
|
||||
self._remove_unwanted_tags(soup)
|
||||
|
||||
# Prune tree starting from body
|
||||
body = soup.find('body')
|
||||
self._prune_tree(body)
|
||||
|
||||
# Extract remaining content as list of HTML strings
|
||||
content_blocks = []
|
||||
for element in body.children:
|
||||
if isinstance(element, str) or not hasattr(element, 'name'):
|
||||
continue
|
||||
if len(element.get_text(strip=True)) > 0:
|
||||
content_blocks.append(str(element))
|
||||
|
||||
return content_blocks
|
||||
|
||||
# Extract candidate text chunks
|
||||
candidates = self.extract_text_chunks(body)
|
||||
def _remove_comments(self, soup):
|
||||
for element in soup(text=lambda text: isinstance(text, Comment)):
|
||||
element.extract()
|
||||
|
||||
if not candidates:
|
||||
return []
|
||||
def _remove_unwanted_tags(self, soup):
|
||||
for tag in self.excluded_tags:
|
||||
for element in soup.find_all(tag):
|
||||
element.decompose()
|
||||
|
||||
# Score each candidate
|
||||
scored_candidates = []
|
||||
for index, text, tag_type, tag in candidates:
|
||||
score = self.score_element(tag, text)
|
||||
if score > 0:
|
||||
scored_candidates.append((score, index, text, tag))
|
||||
def _prune_tree(self, node):
|
||||
if not node or not hasattr(node, 'name') or node.name is None:
|
||||
return
|
||||
|
||||
# Sort candidates by score and then by document order
|
||||
scored_candidates.sort(key=lambda x: (-x[0], x[1]))
|
||||
text_len = len(node.get_text(strip=True))
|
||||
tag_len = len(node.encode_contents().decode('utf-8'))
|
||||
link_text_len = sum(len(s.strip()) for s in (a.string for a in node.find_all('a', recursive=False)) if s)
|
||||
|
||||
# Extract the top candidates (e.g., top 5)
|
||||
top_candidates = scored_candidates[:5] # Adjust the number as needed
|
||||
metrics = {
|
||||
'node': node,
|
||||
'tag_name': node.name,
|
||||
'text_len': text_len,
|
||||
'tag_len': tag_len,
|
||||
'link_text_len': link_text_len
|
||||
}
|
||||
|
||||
# Sort the top candidates back to their original document order
|
||||
top_candidates.sort(key=lambda x: x[1])
|
||||
score = self._compute_composite_score(metrics, text_len, tag_len, link_text_len)
|
||||
|
||||
# Clean and return the content
|
||||
return [self.clean_element(tag) for _, _, _, tag in top_candidates]
|
||||
if self.threshold_type == 'fixed':
|
||||
should_remove = score < self.threshold
|
||||
else: # dynamic
|
||||
tag_importance = self.tag_importance.get(node.name, 0.7)
|
||||
text_ratio = text_len / tag_len if tag_len > 0 else 0
|
||||
link_ratio = link_text_len / text_len if text_len > 0 else 1
|
||||
|
||||
threshold = self.threshold # base threshold
|
||||
if tag_importance > 1:
|
||||
threshold *= 0.8
|
||||
if text_ratio > 0.4:
|
||||
threshold *= 0.9
|
||||
if link_ratio > 0.6:
|
||||
threshold *= 1.2
|
||||
|
||||
should_remove = score < threshold
|
||||
|
||||
def score_element(self, tag: Tag, text: str) -> float:
|
||||
"""Compute a score for an element based on heuristics."""
|
||||
if not text or not tag:
|
||||
return 0
|
||||
if should_remove:
|
||||
node.decompose()
|
||||
else:
|
||||
children = [child for child in node.children if hasattr(child, 'name')]
|
||||
for child in children:
|
||||
self._prune_tree(child)
|
||||
|
||||
# Exclude unwanted tags
|
||||
if self.is_excluded(tag):
|
||||
return 0
|
||||
def _compute_composite_score(self, metrics, text_len, tag_len, link_text_len):
|
||||
if self.min_word_threshold:
|
||||
# Get raw text from metrics node - avoid extra processing
|
||||
text = metrics['node'].get_text(strip=True)
|
||||
word_count = text.count(' ') + 1
|
||||
if word_count < self.min_word_threshold:
|
||||
return -1.0 # Guaranteed removal
|
||||
score = 0.0
|
||||
total_weight = 0.0
|
||||
|
||||
# Text density
|
||||
text_length = len(text.strip())
|
||||
html_length = len(str(tag))
|
||||
text_density = text_length / html_length if html_length > 0 else 0
|
||||
if self.metric_config['text_density']:
|
||||
density = text_len / tag_len if tag_len > 0 else 0
|
||||
score += self.metric_weights['text_density'] * density
|
||||
total_weight += self.metric_weights['text_density']
|
||||
|
||||
# Link density
|
||||
link_text_length = sum(len(a.get_text().strip()) for a in tag.find_all('a'))
|
||||
link_density = link_text_length / text_length if text_length > 0 else 0
|
||||
if self.metric_config['link_density']:
|
||||
density = 1 - (link_text_len / text_len if text_len > 0 else 0)
|
||||
score += self.metric_weights['link_density'] * density
|
||||
total_weight += self.metric_weights['link_density']
|
||||
|
||||
# Tag weight
|
||||
tag_weight = self.tag_weights.get(tag.name, 1)
|
||||
if self.metric_config['tag_weight']:
|
||||
tag_score = self.tag_weights.get(metrics['tag_name'], 0.5)
|
||||
score += self.metric_weights['tag_weight'] * tag_score
|
||||
total_weight += self.metric_weights['tag_weight']
|
||||
|
||||
# Depth factor (prefer elements closer to the body tag)
|
||||
depth = self.get_depth(tag)
|
||||
depth_weight = max(self.max_depth - depth, 1) / self.max_depth
|
||||
if self.metric_config['class_id_weight']:
|
||||
class_score = self._compute_class_id_weight(metrics['node'])
|
||||
score += self.metric_weights['class_id_weight'] * max(0, class_score)
|
||||
total_weight += self.metric_weights['class_id_weight']
|
||||
|
||||
# Compute the final score
|
||||
score = (text_density * tag_weight * depth_weight) / (1 + link_density)
|
||||
if self.metric_config['text_length']:
|
||||
score += self.metric_weights['text_length'] * math.log(text_len + 1)
|
||||
total_weight += self.metric_weights['text_length']
|
||||
|
||||
return score
|
||||
return score / total_weight if total_weight > 0 else 0
|
||||
|
||||
def get_depth(self, tag: Tag) -> int:
|
||||
"""Compute the depth of the tag from the body tag."""
|
||||
depth = 0
|
||||
current = tag
|
||||
while current and current != current.parent and current.name != 'body':
|
||||
current = current.parent
|
||||
depth += 1
|
||||
return depth
|
||||
|
||||
def extract_text_chunks(self, body: Tag) -> List[Tuple[int, str, str, Tag]]:
|
||||
"""
|
||||
Extracts text chunks from the body element while preserving order.
|
||||
Returns list of tuples (index, text, tag_type, tag) for scoring.
|
||||
"""
|
||||
chunks = []
|
||||
index = 0
|
||||
|
||||
def traverse(element):
|
||||
nonlocal index
|
||||
if isinstance(element, NavigableString):
|
||||
return
|
||||
if not isinstance(element, Tag):
|
||||
return
|
||||
if self.is_excluded(element):
|
||||
return
|
||||
# Only consider included tags
|
||||
if element.name in self.included_tags:
|
||||
text = element.get_text(separator=' ', strip=True)
|
||||
if len(text.split()) >= self.min_word_count:
|
||||
tag_type = 'header' if element.name in self.header_tags else 'content'
|
||||
chunks.append((index, text, tag_type, element))
|
||||
index += 1
|
||||
# Do not traverse children of this element to prevent duplication
|
||||
return
|
||||
for child in element.children:
|
||||
traverse(child)
|
||||
|
||||
traverse(body)
|
||||
return chunks
|
||||
|
||||
def is_excluded(self, tag: Tag) -> bool:
|
||||
"""Determine if a tag should be excluded based on heuristics."""
|
||||
if tag.name in self.excluded_tags:
|
||||
return True
|
||||
class_id = ' '.join(filter(None, [
|
||||
' '.join(tag.get('class', [])),
|
||||
tag.get('id', '')
|
||||
]))
|
||||
if self.negative_patterns.search(class_id):
|
||||
return True
|
||||
# Exclude tags with high link density (e.g., navigation menus)
|
||||
text = tag.get_text(separator=' ', strip=True)
|
||||
link_text_length = sum(len(a.get_text(strip=True)) for a in tag.find_all('a'))
|
||||
text_length = len(text)
|
||||
if text_length > 0 and (link_text_length / text_length) > 0.5:
|
||||
return True
|
||||
return False
|
||||
def _compute_class_id_weight(self, node):
|
||||
class_id_score = 0
|
||||
if 'class' in node.attrs:
|
||||
classes = ' '.join(node['class'])
|
||||
if self.negative_patterns.match(classes):
|
||||
class_id_score -= 0.5
|
||||
if 'id' in node.attrs:
|
||||
element_id = node['id']
|
||||
if self.negative_patterns.match(element_id):
|
||||
class_id_score -= 0.5
|
||||
return class_id_score
|
||||
@@ -6,6 +6,7 @@ from concurrent.futures import ThreadPoolExecutor
|
||||
import asyncio, requests, re, os
|
||||
from .config import *
|
||||
from bs4 import element, NavigableString, Comment
|
||||
from bs4 import PageElement, Tag
|
||||
from urllib.parse import urljoin
|
||||
from requests.exceptions import InvalidSchema
|
||||
# from .content_cleaning_strategy import ContentCleaningStrategy
|
||||
@@ -80,31 +81,12 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
|
||||
return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)
|
||||
|
||||
|
||||
def _generate_markdown_content(self,
|
||||
cleaned_html: str,
|
||||
html: str,
|
||||
url: str,
|
||||
success: bool,
|
||||
**kwargs) -> Dict[str, Any]:
|
||||
"""Generate markdown content using either new strategy or legacy method.
|
||||
|
||||
Args:
|
||||
cleaned_html: Sanitized HTML content
|
||||
html: Original HTML content
|
||||
url: Base URL of the page
|
||||
success: Whether scraping was successful
|
||||
**kwargs: Additional options including:
|
||||
- markdown_generator: Optional[MarkdownGenerationStrategy]
|
||||
- html2text: Dict[str, Any] options for HTML2Text
|
||||
- content_filter: Optional[RelevantContentFilter]
|
||||
- fit_markdown: bool
|
||||
- fit_markdown_user_query: Optional[str]
|
||||
- fit_markdown_bm25_threshold: float
|
||||
|
||||
Returns:
|
||||
Dict containing markdown content in various formats
|
||||
"""
|
||||
markdown_generator: Optional[MarkdownGenerationStrategy] = kwargs.get('markdown_generator', DefaultMarkdownGenerator())
|
||||
|
||||
if markdown_generator:
|
||||
@@ -177,13 +159,335 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
'markdown_v2' : markdown_v2
|
||||
}
|
||||
|
||||
def flatten_nested_elements(self, node):
|
||||
if isinstance(node, NavigableString):
|
||||
return node
|
||||
if len(node.contents) == 1 and isinstance(node.contents[0], Tag) and node.contents[0].name == node.name:
|
||||
return self.flatten_nested_elements(node.contents[0])
|
||||
node.contents = [self.flatten_nested_elements(child) for child in node.contents]
|
||||
return node
|
||||
|
||||
def find_closest_parent_with_useful_text(self, tag, **kwargs):
|
||||
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
|
||||
current_tag = tag
|
||||
while current_tag:
|
||||
current_tag = current_tag.parent
|
||||
# Get the text content of the parent tag
|
||||
if current_tag:
|
||||
text_content = current_tag.get_text(separator=' ',strip=True)
|
||||
# Check if the text content has at least word_count_threshold
|
||||
if len(text_content.split()) >= image_description_min_word_threshold:
|
||||
return text_content
|
||||
return None
|
||||
|
||||
def remove_unwanted_attributes(self, element, important_attrs, keep_data_attributes=False):
|
||||
attrs_to_remove = []
|
||||
for attr in element.attrs:
|
||||
if attr not in important_attrs:
|
||||
if keep_data_attributes:
|
||||
if not attr.startswith('data-'):
|
||||
attrs_to_remove.append(attr)
|
||||
else:
|
||||
attrs_to_remove.append(attr)
|
||||
|
||||
for attr in attrs_to_remove:
|
||||
del element[attr]
|
||||
|
||||
def process_image(self, img, url, index, total_images, **kwargs):
|
||||
parse_srcset = lambda s: [{'url': u.strip().split()[0], 'width': u.strip().split()[-1].rstrip('w')
|
||||
if ' ' in u else None}
|
||||
for u in [f"http{p}" for p in s.split("http") if p]]
|
||||
|
||||
# Constants for checks
|
||||
classes_to_check = frozenset(['button', 'icon', 'logo'])
|
||||
tags_to_check = frozenset(['button', 'input'])
|
||||
|
||||
# Pre-fetch commonly used attributes
|
||||
style = img.get('style', '')
|
||||
alt = img.get('alt', '')
|
||||
src = img.get('src', '')
|
||||
data_src = img.get('data-src', '')
|
||||
width = img.get('width')
|
||||
height = img.get('height')
|
||||
parent = img.parent
|
||||
parent_classes = parent.get('class', [])
|
||||
|
||||
# Quick validation checks
|
||||
if ('display:none' in style or
|
||||
parent.name in tags_to_check or
|
||||
any(c in cls for c in parent_classes for cls in classes_to_check) or
|
||||
any(c in src for c in classes_to_check) or
|
||||
any(c in alt for c in classes_to_check)):
|
||||
return None
|
||||
|
||||
# Quick score calculation
|
||||
score = 0
|
||||
if width and width.isdigit():
|
||||
width_val = int(width)
|
||||
score += 1 if width_val > 150 else 0
|
||||
if height and height.isdigit():
|
||||
height_val = int(height)
|
||||
score += 1 if height_val > 150 else 0
|
||||
if alt:
|
||||
score += 1
|
||||
score += index/total_images < 0.5
|
||||
|
||||
image_format = ''
|
||||
if "data:image/" in src:
|
||||
image_format = src.split(',')[0].split(';')[0].split('/')[1].split(';')[0]
|
||||
else:
|
||||
image_format = os.path.splitext(src)[1].lower().strip('.').split('?')[0]
|
||||
|
||||
if image_format in ('jpg', 'png', 'webp', 'avif'):
|
||||
score += 1
|
||||
|
||||
if score <= kwargs.get('image_score_threshold', IMAGE_SCORE_THRESHOLD):
|
||||
return None
|
||||
|
||||
# Use set for deduplication
|
||||
unique_urls = set()
|
||||
image_variants = []
|
||||
|
||||
# Generate a unique group ID for this set of variants
|
||||
group_id = index
|
||||
|
||||
# Base image info template
|
||||
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
|
||||
base_info = {
|
||||
'alt': alt,
|
||||
'desc': self.find_closest_parent_with_useful_text(img, **kwargs),
|
||||
'score': score,
|
||||
'type': 'image',
|
||||
'group_id': group_id # Group ID for this set of variants
|
||||
}
|
||||
|
||||
# Inline function for adding variants
|
||||
def add_variant(src, width=None):
|
||||
if src and not src.startswith('data:') and src not in unique_urls:
|
||||
unique_urls.add(src)
|
||||
image_variants.append({**base_info, 'src': src, 'width': width})
|
||||
|
||||
# Process all sources
|
||||
add_variant(src)
|
||||
add_variant(data_src)
|
||||
|
||||
# Handle srcset and data-srcset in one pass
|
||||
for attr in ('srcset', 'data-srcset'):
|
||||
if value := img.get(attr):
|
||||
for source in parse_srcset(value):
|
||||
add_variant(source['url'], source['width'])
|
||||
|
||||
# Quick picture element check
|
||||
if picture := img.find_parent('picture'):
|
||||
for source in picture.find_all('source'):
|
||||
if srcset := source.get('srcset'):
|
||||
for src in parse_srcset(srcset):
|
||||
add_variant(src['url'], src['width'])
|
||||
|
||||
# Framework-specific attributes in one pass
|
||||
for attr, value in img.attrs.items():
|
||||
if attr.startswith('data-') and ('src' in attr or 'srcset' in attr) and 'http' in value:
|
||||
add_variant(value)
|
||||
|
||||
return image_variants if image_variants else None
|
||||
|
||||
|
||||
def process_element(self, url, element: PageElement, **kwargs) -> Dict[str, Any]:
|
||||
media = {'images': [], 'videos': [], 'audios': []}
|
||||
internal_links_dict = {}
|
||||
external_links_dict = {}
|
||||
self._process_element(
|
||||
url,
|
||||
element,
|
||||
media,
|
||||
internal_links_dict,
|
||||
external_links_dict,
|
||||
**kwargs
|
||||
)
|
||||
return {
|
||||
'media': media,
|
||||
'internal_links_dict': internal_links_dict,
|
||||
'external_links_dict': external_links_dict
|
||||
}
|
||||
|
||||
def _process_element(self, url, element: PageElement, media: Dict[str, Any], internal_links_dict: Dict[str, Any], external_links_dict: Dict[str, Any], **kwargs) -> bool:
|
||||
try:
|
||||
if isinstance(element, NavigableString):
|
||||
if isinstance(element, Comment):
|
||||
element.extract()
|
||||
return False
|
||||
|
||||
# if element.name == 'img':
|
||||
# process_image(element, url, 0, 1)
|
||||
# return True
|
||||
|
||||
if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
keep_element = False
|
||||
|
||||
exclude_social_media_domains = SOCIAL_MEDIA_DOMAINS + kwargs.get('exclude_social_media_domains', [])
|
||||
exclude_social_media_domains = list(set(exclude_social_media_domains))
|
||||
|
||||
try:
|
||||
if element.name == 'a' and element.get('href'):
|
||||
href = element.get('href', '').strip()
|
||||
if not href: # Skip empty hrefs
|
||||
return False
|
||||
|
||||
url_base = url.split('/')[2]
|
||||
|
||||
# Normalize the URL
|
||||
try:
|
||||
normalized_href = normalize_url(href, url)
|
||||
except ValueError as e:
|
||||
# logging.warning(f"Invalid URL format: {href}, Error: {str(e)}")
|
||||
return False
|
||||
|
||||
link_data = {
|
||||
'href': normalized_href,
|
||||
'text': element.get_text().strip(),
|
||||
'title': element.get('title', '').strip()
|
||||
}
|
||||
|
||||
# Check for duplicates and add to appropriate dictionary
|
||||
is_external = is_external_url(normalized_href, url_base)
|
||||
if is_external:
|
||||
if normalized_href not in external_links_dict:
|
||||
external_links_dict[normalized_href] = link_data
|
||||
else:
|
||||
if normalized_href not in internal_links_dict:
|
||||
internal_links_dict[normalized_href] = link_data
|
||||
|
||||
keep_element = True
|
||||
|
||||
# Handle external link exclusions
|
||||
if is_external:
|
||||
if kwargs.get('exclude_external_links', False):
|
||||
element.decompose()
|
||||
return False
|
||||
elif kwargs.get('exclude_social_media_links', False):
|
||||
if any(domain in normalized_href.lower() for domain in exclude_social_media_domains):
|
||||
element.decompose()
|
||||
return False
|
||||
elif kwargs.get('exclude_domains', []):
|
||||
if any(domain in normalized_href.lower() for domain in kwargs.get('exclude_domains', [])):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"Error processing links: {str(e)}")
|
||||
|
||||
try:
|
||||
if element.name == 'img':
|
||||
potential_sources = ['src', 'data-src', 'srcset' 'data-lazy-src', 'data-original']
|
||||
src = element.get('src', '')
|
||||
while not src and potential_sources:
|
||||
src = element.get(potential_sources.pop(0), '')
|
||||
if not src:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
# If it is srcset pick up the first image
|
||||
if 'srcset' in element.attrs:
|
||||
src = element.attrs['srcset'].split(',')[0].split(' ')[0]
|
||||
|
||||
# Check flag if we should remove external images
|
||||
if kwargs.get('exclude_external_images', False):
|
||||
src_url_base = src.split('/')[2]
|
||||
url_base = url.split('/')[2]
|
||||
if url_base not in src_url_base:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
if not kwargs.get('exclude_external_images', False) and kwargs.get('exclude_social_media_links', False):
|
||||
src_url_base = src.split('/')[2]
|
||||
url_base = url.split('/')[2]
|
||||
if any(domain in src for domain in exclude_social_media_domains):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
# Handle exclude domains
|
||||
if kwargs.get('exclude_domains', []):
|
||||
if any(domain in src for domain in kwargs.get('exclude_domains', [])):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
return True # Always keep image elements
|
||||
except Exception as e:
|
||||
raise "Error processing images"
|
||||
|
||||
|
||||
# Check if flag to remove all forms is set
|
||||
if kwargs.get('remove_forms', False) and element.name == 'form':
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
if element.name in ['video', 'audio']:
|
||||
media[f"{element.name}s"].append({
|
||||
'src': element.get('src'),
|
||||
'alt': element.get('alt'),
|
||||
'type': element.name,
|
||||
'description': self.find_closest_parent_with_useful_text(element, **kwargs)
|
||||
})
|
||||
source_tags = element.find_all('source')
|
||||
for source_tag in source_tags:
|
||||
media[f"{element.name}s"].append({
|
||||
'src': source_tag.get('src'),
|
||||
'alt': element.get('alt'),
|
||||
'type': element.name,
|
||||
'description': self.find_closest_parent_with_useful_text(element, **kwargs)
|
||||
})
|
||||
return True # Always keep video and audio elements
|
||||
|
||||
if element.name in ONLY_TEXT_ELIGIBLE_TAGS:
|
||||
if kwargs.get('only_text', False):
|
||||
element.replace_with(element.get_text())
|
||||
|
||||
try:
|
||||
self.remove_unwanted_attributes(element, IMPORTANT_ATTRS, kwargs.get('keep_data_attributes', False))
|
||||
except Exception as e:
|
||||
# print('Error removing unwanted attributes:', str(e))
|
||||
self._log('error',
|
||||
message="Error removing unwanted attributes: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
# Process children
|
||||
for child in list(element.children):
|
||||
if isinstance(child, NavigableString) and not isinstance(child, Comment):
|
||||
if len(child.strip()) > 0:
|
||||
keep_element = True
|
||||
else:
|
||||
if self._process_element(url, child, media, internal_links_dict, external_links_dict, **kwargs):
|
||||
keep_element = True
|
||||
|
||||
|
||||
# Check word count
|
||||
word_count_threshold = kwargs.get('word_count_threshold', MIN_WORD_THRESHOLD)
|
||||
if not keep_element:
|
||||
word_count = len(element.get_text(strip=True).split())
|
||||
keep_element = word_count >= word_count_threshold
|
||||
|
||||
if not keep_element:
|
||||
element.decompose()
|
||||
|
||||
return keep_element
|
||||
except Exception as e:
|
||||
# print('Error processing element:', str(e))
|
||||
self._log('error',
|
||||
message="Error processing element: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
return False
|
||||
|
||||
def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
|
||||
success = True
|
||||
if not html:
|
||||
return None
|
||||
|
||||
# soup = BeautifulSoup(html, 'html.parser')
|
||||
soup = BeautifulSoup(html, 'lxml')
|
||||
body = soup.body
|
||||
|
||||
@@ -195,15 +499,24 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
# print('Error extracting metadata:', str(e))
|
||||
meta = {}
|
||||
|
||||
# Handle tag-based removal first - faster than CSS selection
|
||||
excluded_tags = set(kwargs.get('excluded_tags', []) or [])
|
||||
if excluded_tags:
|
||||
for element in body.find_all(lambda tag: tag.name in excluded_tags):
|
||||
element.extract()
|
||||
|
||||
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
|
||||
|
||||
for tag in kwargs.get('excluded_tags', []) or []:
|
||||
for el in body.select(tag):
|
||||
el.decompose()
|
||||
# Handle CSS selector-based removal
|
||||
excluded_selector = kwargs.get('excluded_selector', '')
|
||||
if excluded_selector:
|
||||
is_single_selector = ',' not in excluded_selector and ' ' not in excluded_selector
|
||||
if is_single_selector:
|
||||
while element := body.select_one(excluded_selector):
|
||||
element.extract()
|
||||
else:
|
||||
for element in body.select(excluded_selector):
|
||||
element.extract()
|
||||
|
||||
if css_selector:
|
||||
selected_elements = body.select(css_selector)
|
||||
@@ -222,384 +535,17 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
for el in selected_elements:
|
||||
body.append(el)
|
||||
|
||||
links = {'internal': [], 'external': []}
|
||||
media = {'images': [], 'videos': [], 'audios': []}
|
||||
internal_links_dict = {}
|
||||
external_links_dict = {}
|
||||
|
||||
# Extract meaningful text for media files from closest parent
|
||||
def find_closest_parent_with_useful_text(tag):
|
||||
current_tag = tag
|
||||
while current_tag:
|
||||
current_tag = current_tag.parent
|
||||
# Get the text content of the parent tag
|
||||
if current_tag:
|
||||
text_content = current_tag.get_text(separator=' ',strip=True)
|
||||
# Check if the text content has at least word_count_threshold
|
||||
if len(text_content.split()) >= image_description_min_word_threshold:
|
||||
return text_content
|
||||
return None
|
||||
|
||||
def process_image_old(img, url, index, total_images):
|
||||
|
||||
|
||||
#Check if an image has valid display and inside undesired html elements
|
||||
def is_valid_image(img, parent, parent_classes):
|
||||
style = img.get('style', '')
|
||||
src = img.get('src', '')
|
||||
classes_to_check = ['button', 'icon', 'logo']
|
||||
tags_to_check = ['button', 'input']
|
||||
return all([
|
||||
'display:none' not in style,
|
||||
src,
|
||||
not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
|
||||
parent.name not in tags_to_check
|
||||
])
|
||||
|
||||
#Score an image for it's usefulness
|
||||
def score_image_for_usefulness(img, base_url, index, images_count):
|
||||
image_height = img.get('height')
|
||||
height_value, height_unit = parse_dimension(image_height)
|
||||
image_width = img.get('width')
|
||||
width_value, width_unit = parse_dimension(image_width)
|
||||
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
|
||||
image_src = img.get('src','')
|
||||
if "data:image/" in image_src:
|
||||
image_format = image_src.split(',')[0].split(';')[0].split('/')[1]
|
||||
else:
|
||||
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
||||
# Remove . from format
|
||||
image_format = image_format.strip('.').split('?')[0]
|
||||
score = 0
|
||||
if height_value:
|
||||
if height_unit == 'px' and height_value > 150:
|
||||
score += 1
|
||||
if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
|
||||
score += 1
|
||||
if width_value:
|
||||
if width_unit == 'px' and width_value > 150:
|
||||
score += 1
|
||||
if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
|
||||
score += 1
|
||||
if image_size > 10000:
|
||||
score += 1
|
||||
if img.get('alt') != '':
|
||||
score+=1
|
||||
if any(image_format==format for format in ['jpg','png','webp']):
|
||||
score+=1
|
||||
if index/images_count<0.5:
|
||||
score+=1
|
||||
return score
|
||||
|
||||
if not is_valid_image(img, img.parent, img.parent.get('class', [])):
|
||||
return None
|
||||
|
||||
score = score_image_for_usefulness(img, url, index, total_images)
|
||||
if score <= kwargs.get('image_score_threshold', IMAGE_SCORE_THRESHOLD):
|
||||
return None
|
||||
|
||||
base_result = {
|
||||
'src': img.get('src', ''),
|
||||
'data-src': img.get('data-src', ''),
|
||||
'alt': img.get('alt', ''),
|
||||
'desc': find_closest_parent_with_useful_text(img),
|
||||
'score': score,
|
||||
'type': 'image'
|
||||
}
|
||||
|
||||
sources = []
|
||||
srcset = img.get('srcset', '')
|
||||
if srcset:
|
||||
sources = parse_srcset(srcset)
|
||||
if sources:
|
||||
return [dict(base_result, src=source['url'], width=source['width'])
|
||||
for source in sources]
|
||||
|
||||
return [base_result] # Always return a list
|
||||
|
||||
def process_image(img, url, index, total_images):
|
||||
parse_srcset = lambda s: [{'url': u.strip().split()[0], 'width': u.strip().split()[-1].rstrip('w')
|
||||
if ' ' in u else None}
|
||||
for u in [f"http{p}" for p in s.split("http") if p]]
|
||||
|
||||
# Constants for checks
|
||||
classes_to_check = frozenset(['button', 'icon', 'logo'])
|
||||
tags_to_check = frozenset(['button', 'input'])
|
||||
|
||||
# Pre-fetch commonly used attributes
|
||||
style = img.get('style', '')
|
||||
alt = img.get('alt', '')
|
||||
src = img.get('src', '')
|
||||
data_src = img.get('data-src', '')
|
||||
width = img.get('width')
|
||||
height = img.get('height')
|
||||
parent = img.parent
|
||||
parent_classes = parent.get('class', [])
|
||||
|
||||
# Quick validation checks
|
||||
if ('display:none' in style or
|
||||
parent.name in tags_to_check or
|
||||
any(c in cls for c in parent_classes for cls in classes_to_check) or
|
||||
any(c in src for c in classes_to_check) or
|
||||
any(c in alt for c in classes_to_check)):
|
||||
return None
|
||||
|
||||
# Quick score calculation
|
||||
score = 0
|
||||
if width and width.isdigit():
|
||||
width_val = int(width)
|
||||
score += 1 if width_val > 150 else 0
|
||||
if height and height.isdigit():
|
||||
height_val = int(height)
|
||||
score += 1 if height_val > 150 else 0
|
||||
if alt:
|
||||
score += 1
|
||||
score += index/total_images < 0.5
|
||||
|
||||
image_format = ''
|
||||
if "data:image/" in src:
|
||||
image_format = src.split(',')[0].split(';')[0].split('/')[1].split(';')[0]
|
||||
else:
|
||||
image_format = os.path.splitext(src)[1].lower().strip('.').split('?')[0]
|
||||
|
||||
if image_format in ('jpg', 'png', 'webp', 'avif'):
|
||||
score += 1
|
||||
|
||||
if score <= kwargs.get('image_score_threshold', IMAGE_SCORE_THRESHOLD):
|
||||
return None
|
||||
|
||||
# Use set for deduplication
|
||||
unique_urls = set()
|
||||
image_variants = []
|
||||
|
||||
# Generate a unique group ID for this set of variants
|
||||
group_id = index
|
||||
|
||||
# Base image info template
|
||||
base_info = {
|
||||
'alt': alt,
|
||||
'desc': find_closest_parent_with_useful_text(img),
|
||||
'score': score,
|
||||
'type': 'image',
|
||||
'group_id': group_id # Group ID for this set of variants
|
||||
}
|
||||
|
||||
# Inline function for adding variants
|
||||
def add_variant(src, width=None):
|
||||
if src and not src.startswith('data:') and src not in unique_urls:
|
||||
unique_urls.add(src)
|
||||
image_variants.append({**base_info, 'src': src, 'width': width})
|
||||
|
||||
# Process all sources
|
||||
add_variant(src)
|
||||
add_variant(data_src)
|
||||
|
||||
# Handle srcset and data-srcset in one pass
|
||||
for attr in ('srcset', 'data-srcset'):
|
||||
if value := img.get(attr):
|
||||
for source in parse_srcset(value):
|
||||
add_variant(source['url'], source['width'])
|
||||
|
||||
# Quick picture element check
|
||||
if picture := img.find_parent('picture'):
|
||||
for source in picture.find_all('source'):
|
||||
if srcset := source.get('srcset'):
|
||||
for src in parse_srcset(srcset):
|
||||
add_variant(src['url'], src['width'])
|
||||
|
||||
# Framework-specific attributes in one pass
|
||||
for attr, value in img.attrs.items():
|
||||
if attr.startswith('data-') and ('src' in attr or 'srcset' in attr) and 'http' in value:
|
||||
add_variant(value)
|
||||
|
||||
return image_variants if image_variants else None
|
||||
|
||||
def remove_unwanted_attributes(element, important_attrs, keep_data_attributes=False):
|
||||
attrs_to_remove = []
|
||||
for attr in element.attrs:
|
||||
if attr not in important_attrs:
|
||||
if keep_data_attributes:
|
||||
if not attr.startswith('data-'):
|
||||
attrs_to_remove.append(attr)
|
||||
else:
|
||||
attrs_to_remove.append(attr)
|
||||
|
||||
for attr in attrs_to_remove:
|
||||
del element[attr]
|
||||
result_obj = self.process_element(
|
||||
url,
|
||||
body,
|
||||
word_count_threshold = word_count_threshold,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
def process_element(element: element.PageElement) -> bool:
|
||||
try:
|
||||
if isinstance(element, NavigableString):
|
||||
if isinstance(element, Comment):
|
||||
element.extract()
|
||||
return False
|
||||
|
||||
# if element.name == 'img':
|
||||
# process_image(element, url, 0, 1)
|
||||
# return True
|
||||
|
||||
if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
keep_element = False
|
||||
|
||||
exclude_social_media_domains = SOCIAL_MEDIA_DOMAINS + kwargs.get('exclude_social_media_domains', [])
|
||||
exclude_social_media_domains = list(set(exclude_social_media_domains))
|
||||
|
||||
try:
|
||||
if element.name == 'a' and element.get('href'):
|
||||
href = element.get('href', '').strip()
|
||||
if not href: # Skip empty hrefs
|
||||
return False
|
||||
|
||||
url_base = url.split('/')[2]
|
||||
|
||||
# Normalize the URL
|
||||
try:
|
||||
normalized_href = normalize_url(href, url)
|
||||
except ValueError as e:
|
||||
# logging.warning(f"Invalid URL format: {href}, Error: {str(e)}")
|
||||
return False
|
||||
|
||||
link_data = {
|
||||
'href': normalized_href,
|
||||
'text': element.get_text().strip(),
|
||||
'title': element.get('title', '').strip()
|
||||
}
|
||||
|
||||
# Check for duplicates and add to appropriate dictionary
|
||||
is_external = is_external_url(normalized_href, url_base)
|
||||
if is_external:
|
||||
if normalized_href not in external_links_dict:
|
||||
external_links_dict[normalized_href] = link_data
|
||||
else:
|
||||
if normalized_href not in internal_links_dict:
|
||||
internal_links_dict[normalized_href] = link_data
|
||||
|
||||
keep_element = True
|
||||
|
||||
# Handle external link exclusions
|
||||
if is_external:
|
||||
if kwargs.get('exclude_external_links', False):
|
||||
element.decompose()
|
||||
return False
|
||||
elif kwargs.get('exclude_social_media_links', False):
|
||||
if any(domain in normalized_href.lower() for domain in exclude_social_media_domains):
|
||||
element.decompose()
|
||||
return False
|
||||
elif kwargs.get('exclude_domains', []):
|
||||
if any(domain in normalized_href.lower() for domain in kwargs.get('exclude_domains', [])):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"Error processing links: {str(e)}")
|
||||
|
||||
try:
|
||||
if element.name == 'img':
|
||||
potential_sources = ['src', 'data-src', 'srcset' 'data-lazy-src', 'data-original']
|
||||
src = element.get('src', '')
|
||||
while not src and potential_sources:
|
||||
src = element.get(potential_sources.pop(0), '')
|
||||
if not src:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
# If it is srcset pick up the first image
|
||||
if 'srcset' in element.attrs:
|
||||
src = element.attrs['srcset'].split(',')[0].split(' ')[0]
|
||||
|
||||
# Check flag if we should remove external images
|
||||
if kwargs.get('exclude_external_images', False):
|
||||
src_url_base = src.split('/')[2]
|
||||
url_base = url.split('/')[2]
|
||||
if url_base not in src_url_base:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
if not kwargs.get('exclude_external_images', False) and kwargs.get('exclude_social_media_links', False):
|
||||
src_url_base = src.split('/')[2]
|
||||
url_base = url.split('/')[2]
|
||||
if any(domain in src for domain in exclude_social_media_domains):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
# Handle exclude domains
|
||||
if kwargs.get('exclude_domains', []):
|
||||
if any(domain in src for domain in kwargs.get('exclude_domains', [])):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
return True # Always keep image elements
|
||||
except Exception as e:
|
||||
raise "Error processing images"
|
||||
|
||||
|
||||
# Check if flag to remove all forms is set
|
||||
if kwargs.get('remove_forms', False) and element.name == 'form':
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
if element.name in ['video', 'audio']:
|
||||
media[f"{element.name}s"].append({
|
||||
'src': element.get('src'),
|
||||
'alt': element.get('alt'),
|
||||
'type': element.name,
|
||||
'description': find_closest_parent_with_useful_text(element)
|
||||
})
|
||||
source_tags = element.find_all('source')
|
||||
for source_tag in source_tags:
|
||||
media[f"{element.name}s"].append({
|
||||
'src': source_tag.get('src'),
|
||||
'alt': element.get('alt'),
|
||||
'type': element.name,
|
||||
'description': find_closest_parent_with_useful_text(element)
|
||||
})
|
||||
return True # Always keep video and audio elements
|
||||
|
||||
if element.name in ONLY_TEXT_ELIGIBLE_TAGS:
|
||||
if kwargs.get('only_text', False):
|
||||
element.replace_with(element.get_text())
|
||||
|
||||
try:
|
||||
remove_unwanted_attributes(element, IMPORTANT_ATTRS, kwargs.get('keep_data_attributes', False))
|
||||
except Exception as e:
|
||||
# print('Error removing unwanted attributes:', str(e))
|
||||
self._log('error',
|
||||
message="Error removing unwanted attributes: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
# Process children
|
||||
for child in list(element.children):
|
||||
if isinstance(child, NavigableString) and not isinstance(child, Comment):
|
||||
if len(child.strip()) > 0:
|
||||
keep_element = True
|
||||
else:
|
||||
if process_element(child):
|
||||
keep_element = True
|
||||
|
||||
|
||||
# Check word count
|
||||
if not keep_element:
|
||||
word_count = len(element.get_text(strip=True).split())
|
||||
keep_element = word_count >= word_count_threshold
|
||||
|
||||
if not keep_element:
|
||||
element.decompose()
|
||||
|
||||
return keep_element
|
||||
except Exception as e:
|
||||
# print('Error processing element:', str(e))
|
||||
self._log('error',
|
||||
message="Error processing element: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
return False
|
||||
|
||||
process_element(body)
|
||||
links = {'internal': [], 'external': []}
|
||||
media = result_obj['media']
|
||||
internal_links_dict = result_obj['internal_links_dict']
|
||||
external_links_dict = result_obj['external_links_dict']
|
||||
|
||||
# Update the links dictionary with unique links
|
||||
links['internal'] = list(internal_links_dict.values())
|
||||
@@ -608,23 +554,14 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
# # Process images using ThreadPoolExecutor
|
||||
imgs = body.find_all('img')
|
||||
|
||||
# For test we use for loop instead of thread
|
||||
media['images'] = [
|
||||
img for result in (process_image(img, url, i, len(imgs))
|
||||
img for result in (self.process_image(img, url, i, len(imgs))
|
||||
for i, img in enumerate(imgs))
|
||||
if result is not None
|
||||
for img in result
|
||||
]
|
||||
|
||||
def flatten_nested_elements(node):
|
||||
if isinstance(node, NavigableString):
|
||||
return node
|
||||
if len(node.contents) == 1 and isinstance(node.contents[0], element.Tag) and node.contents[0].name == node.name:
|
||||
return flatten_nested_elements(node.contents[0])
|
||||
node.contents = [flatten_nested_elements(child) for child in node.contents]
|
||||
return node
|
||||
|
||||
body = flatten_nested_elements(body)
|
||||
body = self.flatten_nested_elements(body)
|
||||
base64_pattern = re.compile(r'data:image/[^;]+;base64,([^"]+)')
|
||||
for img in imgs:
|
||||
src = img.get('src', '')
|
||||
|
||||
@@ -11,8 +11,9 @@ LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')
|
||||
|
||||
class MarkdownGenerationStrategy(ABC):
|
||||
"""Abstract base class for markdown generation strategies."""
|
||||
def __init__(self, content_filter: Optional[RelevantContentFilter] = None):
|
||||
def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
|
||||
self.content_filter = content_filter
|
||||
self.options = options or {}
|
||||
|
||||
@abstractmethod
|
||||
def generate_markdown(self,
|
||||
@@ -27,8 +28,8 @@ class MarkdownGenerationStrategy(ABC):
|
||||
|
||||
class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
"""Default implementation of markdown generation strategy."""
|
||||
def __init__(self, content_filter: Optional[RelevantContentFilter] = None):
|
||||
super().__init__(content_filter)
|
||||
def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
|
||||
super().__init__(content_filter, options)
|
||||
|
||||
def convert_links_to_citations(self, markdown: str, base_url: str = "") -> Tuple[str, str]:
|
||||
link_map = {}
|
||||
@@ -74,6 +75,7 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
cleaned_html: str,
|
||||
base_url: str = "",
|
||||
html2text_options: Optional[Dict[str, Any]] = None,
|
||||
options: Optional[Dict[str, Any]] = None,
|
||||
content_filter: Optional[RelevantContentFilter] = None,
|
||||
citations: bool = True,
|
||||
**kwargs) -> MarkdownGenerationResult:
|
||||
@@ -82,6 +84,10 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
h = CustomHTML2Text()
|
||||
if html2text_options:
|
||||
h.update_params(**html2text_options)
|
||||
elif options:
|
||||
h.update_params(**options)
|
||||
elif self.options:
|
||||
h.update_params(**self.options)
|
||||
|
||||
# Generate raw markdown
|
||||
raw_markdown = h.handle(cleaned_html)
|
||||
|
||||
263
crawl4ai/user_agent_generator.py
Normal file
263
crawl4ai/user_agent_generator.py
Normal file
@@ -0,0 +1,263 @@
|
||||
import random
|
||||
from typing import Optional, Literal, List, Dict, Tuple
|
||||
import re
|
||||
|
||||
|
||||
class UserAgentGenerator:
|
||||
def __init__(self):
|
||||
# Previous platform definitions remain the same...
|
||||
self.desktop_platforms = {
|
||||
"windows": {
|
||||
"10_64": "(Windows NT 10.0; Win64; x64)",
|
||||
"10_32": "(Windows NT 10.0; WOW64)",
|
||||
},
|
||||
"macos": {
|
||||
"intel": "(Macintosh; Intel Mac OS X 10_15_7)",
|
||||
"newer": "(Macintosh; Intel Mac OS X 10.15; rv:109.0)",
|
||||
},
|
||||
"linux": {
|
||||
"generic": "(X11; Linux x86_64)",
|
||||
"ubuntu": "(X11; Ubuntu; Linux x86_64)",
|
||||
"chrome_os": "(X11; CrOS x86_64 14541.0.0)",
|
||||
}
|
||||
}
|
||||
|
||||
self.mobile_platforms = {
|
||||
"android": {
|
||||
"samsung": "(Linux; Android 13; SM-S901B)",
|
||||
"pixel": "(Linux; Android 12; Pixel 6)",
|
||||
"oneplus": "(Linux; Android 13; OnePlus 9 Pro)",
|
||||
"xiaomi": "(Linux; Android 12; M2102J20SG)",
|
||||
},
|
||||
"ios": {
|
||||
"iphone": "(iPhone; CPU iPhone OS 16_5 like Mac OS X)",
|
||||
"ipad": "(iPad; CPU OS 16_5 like Mac OS X)",
|
||||
}
|
||||
}
|
||||
|
||||
# Browser Combinations
|
||||
self.browser_combinations = {
|
||||
1: [
|
||||
["chrome"],
|
||||
["firefox"],
|
||||
["safari"],
|
||||
["edge"]
|
||||
],
|
||||
2: [
|
||||
["gecko", "firefox"],
|
||||
["chrome", "safari"],
|
||||
["webkit", "safari"]
|
||||
],
|
||||
3: [
|
||||
["chrome", "safari", "edge"],
|
||||
["webkit", "chrome", "safari"]
|
||||
]
|
||||
}
|
||||
|
||||
# Rendering Engines with versions
|
||||
self.rendering_engines = {
|
||||
"chrome_webkit": "AppleWebKit/537.36",
|
||||
"safari_webkit": "AppleWebKit/605.1.15",
|
||||
"gecko": [ # Added Gecko versions
|
||||
"Gecko/20100101",
|
||||
"Gecko/20100101", # Firefox usually uses this constant version
|
||||
"Gecko/2010010",
|
||||
]
|
||||
}
|
||||
|
||||
# Browser Versions
|
||||
self.chrome_versions = [
|
||||
"Chrome/119.0.6045.199",
|
||||
"Chrome/118.0.5993.117",
|
||||
"Chrome/117.0.5938.149",
|
||||
"Chrome/116.0.5845.187",
|
||||
"Chrome/115.0.5790.171",
|
||||
]
|
||||
|
||||
self.edge_versions = [
|
||||
"Edg/119.0.2151.97",
|
||||
"Edg/118.0.2088.76",
|
||||
"Edg/117.0.2045.47",
|
||||
"Edg/116.0.1938.81",
|
||||
"Edg/115.0.1901.203",
|
||||
]
|
||||
|
||||
self.safari_versions = [
|
||||
"Safari/537.36", # For Chrome-based
|
||||
"Safari/605.1.15",
|
||||
"Safari/604.1",
|
||||
"Safari/602.1",
|
||||
"Safari/601.5.17",
|
||||
]
|
||||
|
||||
# Added Firefox versions
|
||||
self.firefox_versions = [
|
||||
"Firefox/119.0",
|
||||
"Firefox/118.0.2",
|
||||
"Firefox/117.0.1",
|
||||
"Firefox/116.0",
|
||||
"Firefox/115.0.3",
|
||||
"Firefox/114.0.2",
|
||||
"Firefox/113.0.1",
|
||||
"Firefox/112.0",
|
||||
"Firefox/111.0.1",
|
||||
"Firefox/110.0",
|
||||
]
|
||||
|
||||
def get_browser_stack(self, num_browsers: int = 1) -> List[str]:
|
||||
"""Get a valid combination of browser versions"""
|
||||
if num_browsers not in self.browser_combinations:
|
||||
raise ValueError(f"Unsupported number of browsers: {num_browsers}")
|
||||
|
||||
combination = random.choice(self.browser_combinations[num_browsers])
|
||||
browser_stack = []
|
||||
|
||||
for browser in combination:
|
||||
if browser == "chrome":
|
||||
browser_stack.append(random.choice(self.chrome_versions))
|
||||
elif browser == "firefox":
|
||||
browser_stack.append(random.choice(self.firefox_versions))
|
||||
elif browser == "safari":
|
||||
browser_stack.append(random.choice(self.safari_versions))
|
||||
elif browser == "edge":
|
||||
browser_stack.append(random.choice(self.edge_versions))
|
||||
elif browser == "gecko":
|
||||
browser_stack.append(random.choice(self.rendering_engines["gecko"]))
|
||||
elif browser == "webkit":
|
||||
browser_stack.append(self.rendering_engines["chrome_webkit"])
|
||||
|
||||
return browser_stack
|
||||
|
||||
def generate(self,
|
||||
device_type: Optional[Literal['desktop', 'mobile']] = None,
|
||||
os_type: Optional[str] = None,
|
||||
device_brand: Optional[str] = None,
|
||||
browser_type: Optional[Literal['chrome', 'edge', 'safari', 'firefox']] = None,
|
||||
num_browsers: int = 3) -> str:
|
||||
"""
|
||||
Generate a random user agent with specified constraints.
|
||||
|
||||
Args:
|
||||
device_type: 'desktop' or 'mobile'
|
||||
os_type: 'windows', 'macos', 'linux', 'android', 'ios'
|
||||
device_brand: Specific device brand
|
||||
browser_type: 'chrome', 'edge', 'safari', or 'firefox'
|
||||
num_browsers: Number of browser specifications (1-3)
|
||||
"""
|
||||
# Get platform string
|
||||
platform = self.get_random_platform(device_type, os_type, device_brand)
|
||||
|
||||
# Start with Mozilla
|
||||
components = ["Mozilla/5.0", platform]
|
||||
|
||||
# Add browser stack
|
||||
browser_stack = self.get_browser_stack(num_browsers)
|
||||
|
||||
# Add appropriate legacy token based on browser stack
|
||||
if "Firefox" in str(browser_stack):
|
||||
components.append(random.choice(self.rendering_engines["gecko"]))
|
||||
elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack):
|
||||
components.append(self.rendering_engines["chrome_webkit"])
|
||||
components.append("(KHTML, like Gecko)")
|
||||
|
||||
# Add browser versions
|
||||
components.extend(browser_stack)
|
||||
|
||||
return " ".join(components)
|
||||
|
||||
def generate_with_client_hints(self, **kwargs) -> Tuple[str, str]:
|
||||
"""Generate both user agent and matching client hints"""
|
||||
user_agent = self.generate(**kwargs)
|
||||
client_hints = self.generate_client_hints(user_agent)
|
||||
return user_agent, client_hints
|
||||
|
||||
def get_random_platform(self, device_type, os_type, device_brand):
|
||||
"""Helper method to get random platform based on constraints"""
|
||||
platforms = self.desktop_platforms if device_type == 'desktop' else \
|
||||
self.mobile_platforms if device_type == 'mobile' else \
|
||||
{**self.desktop_platforms, **self.mobile_platforms}
|
||||
|
||||
if os_type:
|
||||
for platform_group in [self.desktop_platforms, self.mobile_platforms]:
|
||||
if os_type in platform_group:
|
||||
platforms = {os_type: platform_group[os_type]}
|
||||
break
|
||||
|
||||
os_key = random.choice(list(platforms.keys()))
|
||||
if device_brand and device_brand in platforms[os_key]:
|
||||
return platforms[os_key][device_brand]
|
||||
return random.choice(list(platforms[os_key].values()))
|
||||
|
||||
def parse_user_agent(self, user_agent: str) -> Dict[str, str]:
|
||||
"""Parse a user agent string to extract browser and version information"""
|
||||
browsers = {
|
||||
'chrome': r'Chrome/(\d+)',
|
||||
'edge': r'Edg/(\d+)',
|
||||
'safari': r'Version/(\d+)',
|
||||
'firefox': r'Firefox/(\d+)'
|
||||
}
|
||||
|
||||
result = {}
|
||||
for browser, pattern in browsers.items():
|
||||
match = re.search(pattern, user_agent)
|
||||
if match:
|
||||
result[browser] = match.group(1)
|
||||
|
||||
return result
|
||||
|
||||
def generate_client_hints(self, user_agent: str) -> str:
|
||||
"""Generate Sec-CH-UA header value based on user agent string"""
|
||||
browsers = self.parse_user_agent(user_agent)
|
||||
|
||||
# Client hints components
|
||||
hints = []
|
||||
|
||||
# Handle different browser combinations
|
||||
if 'chrome' in browsers:
|
||||
hints.append(f'"Chromium";v="{browsers["chrome"]}"')
|
||||
hints.append('"Not_A Brand";v="8"')
|
||||
|
||||
if 'edge' in browsers:
|
||||
hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
|
||||
else:
|
||||
hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
|
||||
|
||||
elif 'firefox' in browsers:
|
||||
# Firefox doesn't typically send Sec-CH-UA
|
||||
return '""'
|
||||
|
||||
elif 'safari' in browsers:
|
||||
# Safari's format for client hints
|
||||
hints.append(f'"Safari";v="{browsers["safari"]}"')
|
||||
hints.append('"Not_A Brand";v="8"')
|
||||
|
||||
return ', '.join(hints)
|
||||
|
||||
# Example usage:
|
||||
if __name__ == "__main__":
|
||||
generator = UserAgentGenerator()
|
||||
print(generator.generate())
|
||||
|
||||
print("\nSingle browser (Chrome):")
|
||||
print(generator.generate(num_browsers=1, browser_type='chrome'))
|
||||
|
||||
print("\nTwo browsers (Gecko/Firefox):")
|
||||
print(generator.generate(num_browsers=2))
|
||||
|
||||
print("\nThree browsers (Chrome/Safari/Edge):")
|
||||
print(generator.generate(num_browsers=3))
|
||||
|
||||
print("\nFirefox on Linux:")
|
||||
print(generator.generate(
|
||||
device_type='desktop',
|
||||
os_type='linux',
|
||||
browser_type='firefox',
|
||||
num_browsers=2
|
||||
))
|
||||
|
||||
print("\nChrome/Safari/Edge on Windows:")
|
||||
print(generator.generate(
|
||||
device_type='desktop',
|
||||
os_type='windows',
|
||||
num_browsers=3
|
||||
))
|
||||
@@ -22,7 +22,7 @@ import textwrap
|
||||
|
||||
from .html2text import HTML2Text
|
||||
class CustomHTML2Text(HTML2Text):
|
||||
def __init__(self, *args, **kwargs):
|
||||
def __init__(self, *args, handle_code_in_pre=False, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
self.inside_pre = False
|
||||
self.inside_code = False
|
||||
@@ -30,6 +30,7 @@ class CustomHTML2Text(HTML2Text):
|
||||
self.current_preserved_tag = None
|
||||
self.preserved_content = []
|
||||
self.preserve_depth = 0
|
||||
self.handle_code_in_pre = handle_code_in_pre
|
||||
|
||||
# Configuration options
|
||||
self.skip_internal_links = False
|
||||
@@ -50,6 +51,8 @@ class CustomHTML2Text(HTML2Text):
|
||||
for key, value in kwargs.items():
|
||||
if key == 'preserve_tags':
|
||||
self.preserve_tags = set(value)
|
||||
elif key == 'handle_code_in_pre':
|
||||
self.handle_code_in_pre = value
|
||||
else:
|
||||
setattr(self, key, value)
|
||||
|
||||
@@ -88,13 +91,21 @@ class CustomHTML2Text(HTML2Text):
|
||||
# Handle pre tags
|
||||
if tag == 'pre':
|
||||
if start:
|
||||
self.o('```\n')
|
||||
self.o('```\n') # Markdown code block start
|
||||
self.inside_pre = True
|
||||
else:
|
||||
self.o('\n```')
|
||||
self.o('\n```\n') # Markdown code block end
|
||||
self.inside_pre = False
|
||||
# elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
|
||||
# pass
|
||||
elif tag == 'code':
|
||||
if self.inside_pre and not self.handle_code_in_pre:
|
||||
# Ignore code tags inside pre blocks if handle_code_in_pre is False
|
||||
return
|
||||
if start:
|
||||
self.o('`') # Markdown inline code start
|
||||
self.inside_code = True
|
||||
else:
|
||||
self.o('`') # Markdown inline code end
|
||||
self.inside_code = False
|
||||
else:
|
||||
super().handle_tag(tag, attrs, start)
|
||||
|
||||
@@ -103,7 +114,39 @@ class CustomHTML2Text(HTML2Text):
|
||||
if self.preserve_depth > 0:
|
||||
self.preserved_content.append(data)
|
||||
return
|
||||
|
||||
if self.inside_pre:
|
||||
# Output the raw content for pre blocks, including content inside code tags
|
||||
self.o(data) # Directly output the data as-is (preserve newlines)
|
||||
return
|
||||
if self.inside_code:
|
||||
# Inline code: no newlines allowed
|
||||
self.o(data.replace('\n', ' '))
|
||||
return
|
||||
|
||||
# Default behavior for other tags
|
||||
super().handle_data(data, entity_char)
|
||||
|
||||
|
||||
# # Handle pre tags
|
||||
# if tag == 'pre':
|
||||
# if start:
|
||||
# self.o('```\n')
|
||||
# self.inside_pre = True
|
||||
# else:
|
||||
# self.o('\n```')
|
||||
# self.inside_pre = False
|
||||
# # elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
|
||||
# # pass
|
||||
# else:
|
||||
# super().handle_tag(tag, attrs, start)
|
||||
|
||||
# def handle_data(self, data, entity_char=False):
|
||||
# """Override handle_data to capture content within preserved tags."""
|
||||
# if self.preserve_depth > 0:
|
||||
# self.preserved_content.append(data)
|
||||
# return
|
||||
# super().handle_data(data, entity_char)
|
||||
class InvalidCSSSelectorError(Exception):
|
||||
pass
|
||||
|
||||
|
||||
0
crawl4ai/utils.scraping.py
Normal file
0
crawl4ai/utils.scraping.py
Normal file
@@ -15,7 +15,7 @@ from bs4 import BeautifulSoup
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
|
||||
from crawl4ai.extraction_strategy import (
|
||||
JsonCssExtractionStrategy,
|
||||
LLMExtractionStrategy,
|
||||
@@ -128,7 +128,7 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider=provider,
|
||||
api_token=api_token,
|
||||
schema=OpenAIModelFee.schema(),
|
||||
schema=OpenAIModelFee.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
||||
@@ -466,7 +466,8 @@ async def speed_comparison():
|
||||
url="https://www.nbcnews.com/business",
|
||||
word_count_threshold=0,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
||||
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
||||
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
||||
),
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
verbose=False,
|
||||
@@ -489,7 +490,8 @@ async def speed_comparison():
|
||||
word_count_threshold=0,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
||||
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
||||
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
||||
),
|
||||
verbose=False,
|
||||
)
|
||||
@@ -545,19 +547,53 @@ async def generate_knowledge_graph():
|
||||
f.write(result.extracted_content)
|
||||
|
||||
async def fit_markdown_remove_overlay():
|
||||
async with AsyncWebCrawler(headless = False) as crawler:
|
||||
url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
headless=True, # Set to False to see what is happening
|
||||
verbose=True,
|
||||
user_agent_mode="random",
|
||||
user_agent_generator_config={
|
||||
"device_type": "mobile",
|
||||
"os_type": "android"
|
||||
},
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
url='https://www.kidocode.com/degrees/technology',
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
word_count_threshold = 10,
|
||||
remove_overlay_elements=True,
|
||||
screenshot = True
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(
|
||||
threshold=0.48, threshold_type="fixed", min_word_threshold=0
|
||||
),
|
||||
options={
|
||||
"ignore_links": True
|
||||
}
|
||||
),
|
||||
# markdown_generator=DefaultMarkdownGenerator(
|
||||
# content_filter=BM25ContentFilter(user_query="", bm25_threshold=1.0),
|
||||
# options={
|
||||
# "ignore_links": True
|
||||
# }
|
||||
# ),
|
||||
)
|
||||
# Save markdown to file
|
||||
with open(os.path.join(__location__, "mexico_places.md"), "w") as f:
|
||||
f.write(result.fit_markdown)
|
||||
|
||||
|
||||
if result.success:
|
||||
print(len(result.markdown_v2.raw_markdown))
|
||||
print(len(result.markdown_v2.markdown_with_citations))
|
||||
print(len(result.markdown_v2.fit_markdown))
|
||||
|
||||
# Save clean html
|
||||
with open(os.path.join(__location__, "output/cleaned_html.html"), "w") as f:
|
||||
f.write(result.cleaned_html)
|
||||
|
||||
with open(os.path.join(__location__, "output/output_raw_markdown.md"), "w") as f:
|
||||
f.write(result.markdown_v2.raw_markdown)
|
||||
|
||||
with open(os.path.join(__location__, "output/output_markdown_with_citations.md"), "w") as f:
|
||||
f.write(result.markdown_v2.markdown_with_citations)
|
||||
|
||||
with open(os.path.join(__location__, "output/output_fit_markdown.md"), "w") as f:
|
||||
f.write(result.markdown_v2.fit_markdown)
|
||||
|
||||
print("Done")
|
||||
|
||||
|
||||
|
||||
@@ -4,7 +4,59 @@ This guide explains how to use content filtering strategies in Crawl4AI to extra
|
||||
|
||||
## Relevance Content Filter
|
||||
|
||||
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
|
||||
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
|
||||
|
||||
|
||||
## Pruning Content Filter
|
||||
|
||||
The `PruningContentFilter` is a tree-shaking algorithm that analyzes the HTML DOM structure and removes less relevant nodes based on various metrics like text density, link density, and tag importance. It evaluates each node using a composite scoring system and "prunes" nodes that fall below a certain threshold.
|
||||
|
||||
### Usage
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
async def filter_content(url):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
content_filter = PruningContentFilter(
|
||||
min_word_threshold=5,
|
||||
threshold_type='dynamic',
|
||||
threshold=0.45
|
||||
)
|
||||
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True)
|
||||
if result.success:
|
||||
print(f"Cleaned Markdown:\n{result.fit_markdown}")
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
- **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
|
||||
|
||||
- **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
|
||||
- `'fixed'`: Uses a constant threshold value for all nodes
|
||||
- `'dynamic'`: Adjusts threshold based on node characteristics like tag importance and text/link ratios
|
||||
|
||||
- **`threshold`**: (Optional, default 0.48) Base threshold value for node pruning:
|
||||
- For fixed threshold: Nodes scoring below this value are removed
|
||||
- For dynamic threshold: This value is adjusted based on node properties
|
||||
|
||||
### How It Works
|
||||
|
||||
The pruning algorithm evaluates each node using multiple metrics:
|
||||
- Text density: Ratio of actual text to overall node content
|
||||
- Link density: Proportion of text within links
|
||||
- Tag importance: Weight based on HTML tag type (e.g., article, p, div)
|
||||
- Content quality: Metrics like text length and structural importance
|
||||
|
||||
Nodes scoring below the threshold are removed, effectively "shaking" less relevant content from the DOM tree. This results in a cleaner document containing only the most relevant content blocks.
|
||||
|
||||
The algorithm is particularly effective for:
|
||||
- Removing boilerplate content
|
||||
- Eliminating navigation menus and sidebars
|
||||
- Preserving main article content
|
||||
- Maintaining document structure while removing noise
|
||||
|
||||
|
||||
## BM25 Algorithm
|
||||
|
||||
|
||||
@@ -4,7 +4,59 @@ This guide explains how to use content filtering strategies in Crawl4AI to extra
|
||||
|
||||
## Relevance Content Filter
|
||||
|
||||
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
|
||||
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
|
||||
|
||||
|
||||
## Pruning Content Filter
|
||||
|
||||
The `PruningContentFilter` is a tree-shaking algorithm that analyzes the HTML DOM structure and removes less relevant nodes based on various metrics like text density, link density, and tag importance. It evaluates each node using a composite scoring system and "prunes" nodes that fall below a certain threshold.
|
||||
|
||||
### Usage
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
async def filter_content(url):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
content_filter = PruningContentFilter(
|
||||
min_word_threshold=5,
|
||||
threshold_type='dynamic',
|
||||
threshold=0.45
|
||||
)
|
||||
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True)
|
||||
if result.success:
|
||||
print(f"Cleaned Markdown:\n{result.fit_markdown}")
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
- **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
|
||||
|
||||
- **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
|
||||
- `'fixed'`: Uses a constant threshold value for all nodes
|
||||
- `'dynamic'`: Adjusts threshold based on node characteristics like tag importance and text/link ratios
|
||||
|
||||
- **`threshold`**: (Optional, default 0.48) Base threshold value for node pruning:
|
||||
- For fixed threshold: Nodes scoring below this value are removed
|
||||
- For dynamic threshold: This value is adjusted based on node properties
|
||||
|
||||
### How It Works
|
||||
|
||||
The pruning algorithm evaluates each node using multiple metrics:
|
||||
- Text density: Ratio of actual text to overall node content
|
||||
- Link density: Proportion of text within links
|
||||
- Tag importance: Weight based on HTML tag type (e.g., article, p, div)
|
||||
- Content quality: Metrics like text length and structural importance
|
||||
|
||||
Nodes scoring below the threshold are removed, effectively "shaking" less relevant content from the DOM tree. This results in a cleaner document containing only the most relevant content blocks.
|
||||
|
||||
The algorithm is particularly effective for:
|
||||
- Removing boilerplate content
|
||||
- Eliminating navigation menus and sidebars
|
||||
- Preserving main article content
|
||||
- Maintaining document structure while removing noise
|
||||
|
||||
|
||||
## BM25 Algorithm
|
||||
|
||||
@@ -21,7 +73,7 @@ from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
async def filter_content(url, query=None):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
content_filter = BM25ContentFilter(user_query=query)
|
||||
result = await crawler.arun(url=url, content_filter=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
|
||||
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
|
||||
if result.success:
|
||||
print(f"Filtered Content (JSON):\n{result.extracted_content}")
|
||||
print(f"\nFiltered Markdown:\n{result.fit_markdown}") # New field in CrawlResult object
|
||||
@@ -71,7 +123,7 @@ class MyCustomFilter(RelevantContentFilter):
|
||||
async def custom_filter_demo(url: str):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
custom_filter = MyCustomFilter()
|
||||
result = await crawler.arun(url, content_filter=custom_filter)
|
||||
result = await crawler.arun(url, extraction_strategy=custom_filter)
|
||||
if result.success:
|
||||
print(result.extracted_content)
|
||||
|
||||
|
||||
37
docs/md_v2/blog/index.md
Normal file
37
docs/md_v2/blog/index.md
Normal file
@@ -0,0 +1,37 @@
|
||||
# Crawl4AI Blog
|
||||
|
||||
Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical insights, and updates about the project. Whether you're looking for the latest improvements or want to dive deep into web crawling techniques, this is the place.
|
||||
|
||||
## Latest Release
|
||||
|
||||
### [0.4.1 - Smarter Crawling with Lazy-Load Handling, Text-Only Mode, and More](releases/0.4.1.md)
|
||||
*December 8, 2024*
|
||||
|
||||
This release brings major improvements to handling lazy-loaded images, a blazing-fast Text-Only Mode, full-page scanning for infinite scrolls, dynamic viewport adjustments, and session reuse for efficient crawling. If you're looking to improve speed, reliability, or handle dynamic content with ease, this update has you covered.
|
||||
|
||||
[Read full release notes →](releases/0.4.1.md)
|
||||
|
||||
---
|
||||
|
||||
### [0.4.0 - Major Content Filtering Update](releases/0.4.0.md)
|
||||
*December 1, 2024*
|
||||
|
||||
Introduced significant improvements to content filtering, multi-threaded environment handling, and user-agent generation. This release features the new PruningContentFilter, enhanced thread safety, and improved test coverage.
|
||||
|
||||
[Read full release notes →](releases/0.4.0.md)
|
||||
|
||||
## Project History
|
||||
|
||||
Curious about how Crawl4AI has evolved? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) for a detailed history of all versions and updates.
|
||||
|
||||
## Categories
|
||||
|
||||
- [Technical Deep Dives](/blog/technical) - Coming soon
|
||||
- [Tutorials & Guides](/blog/tutorials) - Coming soon
|
||||
- [Community Updates](/blog/community) - Coming soon
|
||||
|
||||
## Stay Updated
|
||||
|
||||
- Star us on [GitHub](https://github.com/unclecode/crawl4ai)
|
||||
- Follow [@unclecode](https://twitter.com/unclecode) on Twitter
|
||||
- Join our community discussions on GitHub
|
||||
62
docs/md_v2/blog/releases/0.4.0.md
Normal file
62
docs/md_v2/blog/releases/0.4.0.md
Normal file
@@ -0,0 +1,62 @@
|
||||
# Release Summary for Version 0.4.0 (December 1, 2024)
|
||||
|
||||
## Overview
|
||||
The 0.4.0 release introduces significant improvements to content filtering, multi-threaded environment handling, user-agent generation, and test coverage. Key highlights include the introduction of the PruningContentFilter, designed to automatically identify and extract the most valuable parts of an HTML document, as well as enhancements to the BM25ContentFilter to extend its versatility and effectiveness.
|
||||
|
||||
## Major Features and Enhancements
|
||||
|
||||
### 1. PruningContentFilter
|
||||
- Introduced a new unsupervised content filtering strategy that scores and prunes less relevant nodes in an HTML document based on metrics like text and link density.
|
||||
- Focuses on retaining the most valuable parts of the content, making it highly effective for extracting relevant information from complex web pages.
|
||||
- Fully documented with updated README and expanded user guides.
|
||||
|
||||
### 2. User-Agent Generator
|
||||
- Added a user-agent generator utility that resolves compatibility issues and supports customizable user-agent strings.
|
||||
- By default, the generator randomizes user agents for each request, adding diversity, but users can customize it for tailored scenarios.
|
||||
|
||||
### 3. Enhanced Thread Safety
|
||||
- Improved handling of multi-threaded environments by adding better thread locks for parallel processing, ensuring consistency and stability when running multiple threads.
|
||||
|
||||
### 4. Extended Content Filtering Strategies
|
||||
- Users now have access to both the PruningContentFilter for unsupervised extraction and the BM25ContentFilter for supervised filtering based on user queries.
|
||||
- Enhanced BM25ContentFilter with improved capabilities to process page titles, meta tags, and descriptions, allowing for more effective classification and clustering of text chunks.
|
||||
|
||||
### 5. Documentation Updates
|
||||
- Updated examples and tutorials to promote the use of the PruningContentFilter alongside the BM25ContentFilter, providing clear instructions for selecting the appropriate filter for each use case.
|
||||
|
||||
### 6. Unit Test Enhancements
|
||||
- Added unit tests for PruningContentFilter to ensure accuracy and reliability.
|
||||
- Enhanced BM25ContentFilter tests to cover additional edge cases and performance metrics, particularly for malformed HTML inputs.
|
||||
|
||||
## Revised Change Logs for Version 0.4.0
|
||||
|
||||
### PruningContentFilter (Dec 01, 2024)
|
||||
- Introduced the PruningContentFilter to optimize content extraction by pruning less relevant HTML nodes.
|
||||
- **Affected Files:**
|
||||
- **crawl4ai/content_filter_strategy.py**: Added a scoring-based pruning algorithm.
|
||||
- **README.md**: Updated to include PruningContentFilter usage.
|
||||
- **docs/md_v2/basic/content_filtering.md**: Expanded user documentation, detailing the use and benefits of PruningContentFilter.
|
||||
|
||||
### Unit Tests for PruningContentFilter (Dec 01, 2024)
|
||||
- Added comprehensive unit tests for PruningContentFilter to ensure correctness and efficiency.
|
||||
- **Affected Files:**
|
||||
- **tests/async/test_content_filter_prune.py**: Created tests covering different pruning scenarios to ensure stability and correctness.
|
||||
|
||||
### Enhanced BM25ContentFilter Tests (Dec 01, 2024)
|
||||
- Expanded tests to cover additional extraction scenarios and performance metrics, improving robustness.
|
||||
- **Affected Files:**
|
||||
- **tests/async/test_content_filter_bm25.py**: Added tests for edge cases, including malformed HTML inputs.
|
||||
|
||||
### Documentation and Example Updates (Dec 01, 2024)
|
||||
- Revised examples to illustrate the use of PruningContentFilter alongside existing content filtering methods.
|
||||
- **Affected Files:**
|
||||
- **docs/examples/quickstart_async.py**: Enhanced example clarity and usability for new users.
|
||||
|
||||
## Experimental Features
|
||||
- The PruningContentFilter is still under experimental development, and we continue to gather feedback for further refinements.
|
||||
|
||||
## Conclusion
|
||||
This release significantly enhances the content extraction capabilities of Crawl4ai with the introduction of the PruningContentFilter, improved supervised filtering with BM25ContentFilter, and robust multi-threaded handling. Additionally, the user-agent generator provides much-needed versatility, resolving compatibility issues faced by many users.
|
||||
|
||||
Users are encouraged to experiment with the new content filtering methods to determine which best suits their needs.
|
||||
|
||||
145
docs/md_v2/blog/releases/0.4.1.md
Normal file
145
docs/md_v2/blog/releases/0.4.1.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# Release Summary for Version 0.4.1 (December 8, 2024): Major Efficiency Boosts with New Features!
|
||||
|
||||
_This post was generated with the help of ChatGPT, take everything with a grain of salt. 🧂_
|
||||
|
||||
Hi everyone,
|
||||
|
||||
I just finished putting together version 0.4.1 of Crawl4AI, and there are a few changes in here that I think you’ll find really helpful. I’ll explain what’s new, why it matters, and exactly how you can use these features (with the code to back it up). Let’s get into it.
|
||||
|
||||
---
|
||||
|
||||
### Handling Lazy Loading Better (Images Included)
|
||||
|
||||
One thing that always bugged me with crawlers is how often they miss lazy-loaded content, especially images. In this version, I made sure Crawl4AI **waits for all images to load** before moving forward. This is useful because many modern websites only load images when they’re in the viewport or after some JavaScript executes.
|
||||
|
||||
Here’s how to enable it:
|
||||
|
||||
```python
|
||||
await crawler.crawl(
|
||||
url="https://example.com",
|
||||
wait_for_images=True # Add this argument to ensure images are fully loaded
|
||||
)
|
||||
```
|
||||
|
||||
What this does is:
|
||||
1. Waits for the page to reach a "network idle" state.
|
||||
2. Ensures all images on the page have been completely loaded.
|
||||
|
||||
This single change handles the majority of lazy-loading cases you’re likely to encounter.
|
||||
|
||||
---
|
||||
|
||||
### Text-Only Mode (Fast, Lightweight Crawling)
|
||||
|
||||
Sometimes, you don’t need to download images or process JavaScript at all. For example, if you’re crawling to extract text data, you can enable **text-only mode** to speed things up. By disabling images, JavaScript, and other heavy resources, this mode makes crawling **3-4 times faster** in most cases.
|
||||
|
||||
Here’s how to turn it on:
|
||||
|
||||
```python
|
||||
crawler = AsyncPlaywrightCrawlerStrategy(
|
||||
text_only=True # Set this to True to enable text-only crawling
|
||||
)
|
||||
```
|
||||
|
||||
When `text_only=True`, the crawler automatically:
|
||||
- Disables GPU processing.
|
||||
- Blocks image and JavaScript resources.
|
||||
- Reduces the viewport size to 800x600 (you can override this with `viewport_width` and `viewport_height`).
|
||||
|
||||
If you need to crawl thousands of pages where you only care about text, this mode will save you a ton of time and resources.
|
||||
|
||||
---
|
||||
|
||||
### Adjusting the Viewport Dynamically
|
||||
|
||||
Another useful addition is the ability to **dynamically adjust the viewport size** to match the content on the page. This is particularly helpful when you’re working with responsive layouts or want to ensure all parts of the page load properly.
|
||||
|
||||
Here’s how it works:
|
||||
1. The crawler calculates the page’s width and height after it loads.
|
||||
2. It adjusts the viewport to fit the content dimensions.
|
||||
3. (Optional) It uses Chrome DevTools Protocol (CDP) to simulate zooming out so everything fits in the viewport.
|
||||
|
||||
To enable this, use:
|
||||
|
||||
```python
|
||||
await crawler.crawl(
|
||||
url="https://example.com",
|
||||
adjust_viewport_to_content=True # Dynamically adjusts the viewport
|
||||
)
|
||||
```
|
||||
|
||||
This approach makes sure the entire page gets loaded into the viewport, especially for layouts that load content based on visibility.
|
||||
|
||||
---
|
||||
|
||||
### Simulating Full-Page Scrolling
|
||||
|
||||
Some websites load data dynamically as you scroll down the page. To handle these cases, I added support for **full-page scanning**. It simulates scrolling to the bottom of the page, checking for new content, and capturing it all.
|
||||
|
||||
Here’s an example:
|
||||
|
||||
```python
|
||||
await crawler.crawl(
|
||||
url="https://example.com",
|
||||
scan_full_page=True, # Enables scrolling
|
||||
scroll_delay=0.2 # Waits 200ms between scrolls (optional)
|
||||
)
|
||||
```
|
||||
|
||||
What happens here:
|
||||
1. The crawler scrolls down in increments, waiting for content to load after each scroll.
|
||||
2. It stops when no new content appears (i.e., dynamic elements stop loading).
|
||||
3. It scrolls back to the top before finishing (if necessary).
|
||||
|
||||
If you’ve ever had to deal with infinite scroll pages, this is going to save you a lot of headaches.
|
||||
|
||||
---
|
||||
|
||||
### Reusing Browser Sessions (Save Time on Setup)
|
||||
|
||||
By default, every time you crawl a page, a new browser context (or tab) is created. That’s fine for small crawls, but if you’re working on a large dataset, it’s more efficient to reuse the same session.
|
||||
|
||||
I added a method called `create_session` for this:
|
||||
|
||||
```python
|
||||
session_id = await crawler.create_session()
|
||||
|
||||
# Use the same session for multiple crawls
|
||||
await crawler.crawl(
|
||||
url="https://example.com/page1",
|
||||
session_id=session_id # Reuse the session
|
||||
)
|
||||
await crawler.crawl(
|
||||
url="https://example.com/page2",
|
||||
session_id=session_id
|
||||
)
|
||||
```
|
||||
|
||||
This avoids creating a new tab for every page, speeding up the crawl and reducing memory usage.
|
||||
|
||||
---
|
||||
|
||||
### Other Updates
|
||||
|
||||
Here are a few smaller updates I’ve made:
|
||||
- **Light Mode**: Use `light_mode=True` to disable background processes, extensions, and other unnecessary features, making the browser more efficient.
|
||||
- **Logging**: Improved logs to make debugging easier.
|
||||
- **Defaults**: Added sensible defaults for things like `delay_before_return_html` (now set to 0.1 seconds).
|
||||
|
||||
---
|
||||
|
||||
### How to Get the Update
|
||||
|
||||
You can install or upgrade to version `0.4.1` like this:
|
||||
|
||||
```bash
|
||||
pip install crawl4ai --upgrade
|
||||
```
|
||||
|
||||
As always, I’d love to hear your thoughts. If there’s something you think could be improved or if you have suggestions for future versions, let me know!
|
||||
|
||||
Enjoy the new features, and happy crawling! 🕷️
|
||||
|
||||
---
|
||||
|
||||
|
||||
14
mkdocs.yml
14
mkdocs.yml
@@ -10,7 +10,11 @@ nav:
|
||||
- 'Installation': 'basic/installation.md'
|
||||
- 'Docker Deplotment': 'basic/docker-deploymeny.md'
|
||||
- 'Quick Start': 'basic/quickstart.md'
|
||||
|
||||
- Changelog & Blog:
|
||||
- 'Blog Home': 'blog/index.md'
|
||||
- 'Latest (0.4.1)': 'blog/releases/0.4.1.md'
|
||||
- 'Changelog': 'https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md'
|
||||
|
||||
- Basic:
|
||||
- 'Simple Crawling': 'basic/simple-crawling.md'
|
||||
- 'Output Formats': 'basic/output-formats.md'
|
||||
@@ -50,12 +54,12 @@ nav:
|
||||
- '5. Dynamic Content': 'tutorial/episode_05_JavaScript_Execution_and_Dynamic_Content_Handling.md'
|
||||
- '6. Magic Mode': 'tutorial/episode_06_Magic_Mode_and_Anti-Bot_Protection.md'
|
||||
- '7. Content Cleaning': 'tutorial/episode_07_Content_Cleaning_and_Fit_Markdown.md'
|
||||
- '8. Media Handling': 'tutorial/episode_08_Media_Handling:_Images,_Videos,_and_Audio.md'
|
||||
- '8. Media Handling': 'tutorial/episode_08_Media_Handling_Images_Videos_and_Audio.md'
|
||||
- '9. Link Analysis': 'tutorial/episode_09_Link_Analysis_and_Smart_Filtering.md'
|
||||
- '10. User Simulation': 'tutorial/episode_10_Custom_Headers,_Identity,_and_User_Simulation.md'
|
||||
- '11.1. JSON CSS': 'tutorial/episode_11_1_Extraction_Strategies:_JSON_CSS.md'
|
||||
- '11.2. LLM Strategy': 'tutorial/episode_11_2_Extraction_Strategies:_LLM.md'
|
||||
- '11.3. Cosine Strategy': 'tutorial/episode_11_3_Extraction_Strategies:_Cosine.md'
|
||||
- '11.1. JSON CSS': 'tutorial/episode_11_1_Extraction_Strategies_JSON_CSS.md'
|
||||
- '11.2. LLM Strategy': 'tutorial/episode_11_2_Extraction_Strategies_LLM.md'
|
||||
- '11.3. Cosine Strategy': 'tutorial/episode_11_3_Extraction_Strategies_Cosine.md'
|
||||
- '12. Session Crawling': 'tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites.md'
|
||||
- '13. Text Chunking': 'tutorial/episode_13_Chunking_Strategies_for_Large_Text_Processing.md'
|
||||
- '14. Custom Workflows': 'tutorial/episode_14_Hooks_and_Custom_Workflow_with_AsyncWebCrawler.md'
|
||||
|
||||
159
tests/async/test_content_filter_prune.py
Normal file
159
tests/async/test_content_filter_prune.py
Normal file
@@ -0,0 +1,159 @@
|
||||
import os, sys
|
||||
import pytest
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
sys.path.append(parent_dir)
|
||||
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
@pytest.fixture
|
||||
def basic_html():
|
||||
return """
|
||||
<html>
|
||||
<body>
|
||||
<article>
|
||||
<h1>Main Article</h1>
|
||||
<p>This is a high-quality paragraph with substantial text content. It contains enough words to pass the threshold and has good text density without too many links. This kind of content should survive the pruning process.</p>
|
||||
<div class="sidebar">Low quality sidebar content</div>
|
||||
<div class="social-share">Share buttons</div>
|
||||
</article>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
@pytest.fixture
|
||||
def link_heavy_html():
|
||||
return """
|
||||
<html>
|
||||
<body>
|
||||
<div class="content">
|
||||
<p>Good content paragraph that should remain.</p>
|
||||
<div class="links">
|
||||
<a href="#">Link 1</a>
|
||||
<a href="#">Link 2</a>
|
||||
<a href="#">Link 3</a>
|
||||
<a href="#">Link 4</a>
|
||||
</div>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
@pytest.fixture
|
||||
def mixed_content_html():
|
||||
return """
|
||||
<html>
|
||||
<body>
|
||||
<article>
|
||||
<h1>Article Title</h1>
|
||||
<p class="summary">Short summary.</p>
|
||||
<div class="content">
|
||||
<p>Long high-quality paragraph with substantial content that should definitely survive the pruning process. This content has good text density and proper formatting which makes it valuable for retention.</p>
|
||||
</div>
|
||||
<div class="comments">
|
||||
<p>Short comment 1</p>
|
||||
<p>Short comment 2</p>
|
||||
</div>
|
||||
</article>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
class TestPruningContentFilter:
|
||||
def test_basic_pruning(self, basic_html):
|
||||
"""Test basic content pruning functionality"""
|
||||
filter = PruningContentFilter(min_word_threshold=5)
|
||||
contents = filter.filter_content(basic_html)
|
||||
|
||||
combined_content = ' '.join(contents).lower()
|
||||
assert "high-quality paragraph" in combined_content
|
||||
assert "sidebar content" not in combined_content
|
||||
assert "share buttons" not in combined_content
|
||||
|
||||
def test_min_word_threshold(self, mixed_content_html):
|
||||
"""Test minimum word threshold filtering"""
|
||||
filter = PruningContentFilter(min_word_threshold=10)
|
||||
contents = filter.filter_content(mixed_content_html)
|
||||
|
||||
combined_content = ' '.join(contents).lower()
|
||||
assert "short summary" not in combined_content
|
||||
assert "long high-quality paragraph" in combined_content
|
||||
assert "short comment" not in combined_content
|
||||
|
||||
def test_threshold_types(self, basic_html):
|
||||
"""Test fixed vs dynamic thresholds"""
|
||||
fixed_filter = PruningContentFilter(threshold_type='fixed', threshold=0.48)
|
||||
dynamic_filter = PruningContentFilter(threshold_type='dynamic', threshold=0.45)
|
||||
|
||||
fixed_contents = fixed_filter.filter_content(basic_html)
|
||||
dynamic_contents = dynamic_filter.filter_content(basic_html)
|
||||
|
||||
assert len(fixed_contents) != len(dynamic_contents), \
|
||||
"Fixed and dynamic thresholds should yield different results"
|
||||
|
||||
def test_link_density_impact(self, link_heavy_html):
|
||||
"""Test handling of link-heavy content"""
|
||||
filter = PruningContentFilter(threshold_type='dynamic')
|
||||
contents = filter.filter_content(link_heavy_html)
|
||||
|
||||
combined_content = ' '.join(contents).lower()
|
||||
assert "good content paragraph" in combined_content
|
||||
assert len([c for c in contents if 'href' in c]) < 2, \
|
||||
"Should prune link-heavy sections"
|
||||
|
||||
def test_tag_importance(self, mixed_content_html):
|
||||
"""Test tag importance in scoring"""
|
||||
filter = PruningContentFilter(threshold_type='dynamic')
|
||||
contents = filter.filter_content(mixed_content_html)
|
||||
|
||||
has_article = any('article' in c.lower() for c in contents)
|
||||
has_h1 = any('h1' in c.lower() for c in contents)
|
||||
assert has_article or has_h1, "Should retain important tags"
|
||||
|
||||
def test_empty_input(self):
|
||||
"""Test handling of empty input"""
|
||||
filter = PruningContentFilter()
|
||||
assert filter.filter_content("") == []
|
||||
assert filter.filter_content(None) == []
|
||||
|
||||
def test_malformed_html(self):
|
||||
"""Test handling of malformed HTML"""
|
||||
malformed_html = "<div>Unclosed div<p>Nested<span>content</div>"
|
||||
filter = PruningContentFilter()
|
||||
contents = filter.filter_content(malformed_html)
|
||||
assert isinstance(contents, list)
|
||||
|
||||
def test_performance(self, basic_html):
|
||||
"""Test performance with timer"""
|
||||
filter = PruningContentFilter()
|
||||
|
||||
import time
|
||||
start = time.perf_counter()
|
||||
filter.filter_content(basic_html)
|
||||
duration = time.perf_counter() - start
|
||||
|
||||
# Extra strict on performance since you mentioned milliseconds matter
|
||||
assert duration < 0.1, f"Processing took too long: {duration:.3f} seconds"
|
||||
|
||||
@pytest.mark.parametrize("threshold,expected_count", [
|
||||
(0.3, 4), # Very lenient
|
||||
(0.48, 2), # Default
|
||||
(0.7, 1), # Very strict
|
||||
])
|
||||
def test_threshold_levels(self, mixed_content_html, threshold, expected_count):
|
||||
"""Test different threshold levels"""
|
||||
filter = PruningContentFilter(threshold_type='fixed', threshold=threshold)
|
||||
contents = filter.filter_content(mixed_content_html)
|
||||
assert len(contents) <= expected_count, \
|
||||
f"Expected {expected_count} or fewer elements with threshold {threshold}"
|
||||
|
||||
def test_consistent_output(self, basic_html):
|
||||
"""Test output consistency across multiple runs"""
|
||||
filter = PruningContentFilter()
|
||||
first_run = filter.filter_content(basic_html)
|
||||
second_run = filter.filter_content(basic_html)
|
||||
assert first_run == second_run, "Output should be consistent"
|
||||
|
||||
if __name__ == "__main__":
|
||||
pytest.main([__file__])
|
||||
Reference in New Issue
Block a user