Merge branch 'main' of https://github.com/unclecode/crawl4ai
This commit is contained in:
86
CHANGELOG.md
86
CHANGELOG.md
@@ -1,5 +1,91 @@
|
||||
# Changelog
|
||||
|
||||
## [0.4.1] December 8, 2024
|
||||
|
||||
### **File: `crawl4ai/async_crawler_strategy.py`**
|
||||
|
||||
#### **New Parameters and Attributes Added**
|
||||
- **`text_only` (boolean)**: Enables text-only mode, disables images, JavaScript, and GPU-related features for faster, minimal rendering.
|
||||
- **`light_mode` (boolean)**: Optimizes the browser by disabling unnecessary background processes and features for efficiency.
|
||||
- **`viewport_width` and `viewport_height`**: Dynamically adjusts based on `text_only` mode (default values: 800x600 for `text_only`, 1920x1080 otherwise).
|
||||
- **`extra_args`**: Adds browser-specific flags for `text_only` mode.
|
||||
- **`adjust_viewport_to_content`**: Dynamically adjusts the viewport to the content size for accurate rendering.
|
||||
|
||||
#### **Browser Context Adjustments**
|
||||
- Added **`viewport` adjustments**: Dynamically computed based on `text_only` or custom configuration.
|
||||
- Enhanced support for `light_mode` and `text_only` by adding specific browser arguments to reduce resource consumption.
|
||||
|
||||
#### **Dynamic Content Handling**
|
||||
- **Full Page Scan Feature**:
|
||||
- Scrolls through the entire page while dynamically detecting content changes.
|
||||
- Ensures scrolling stops when no new dynamic content is loaded.
|
||||
|
||||
#### **Session Management**
|
||||
- Added **`create_session`** method:
|
||||
- Creates a new browser session and assigns a unique ID.
|
||||
- Supports persistent and non-persistent contexts with full compatibility for cookies, headers, and proxies.
|
||||
|
||||
#### **Improved Content Loading and Adjustment**
|
||||
- **`adjust_viewport_to_content`**:
|
||||
- Automatically adjusts viewport to match content dimensions.
|
||||
- Includes scaling via Chrome DevTools Protocol (CDP).
|
||||
- Enhanced content loading:
|
||||
- Waits for images to load and ensures network activity is idle before proceeding.
|
||||
|
||||
#### **Error Handling and Logging**
|
||||
- Improved error handling and detailed logging for:
|
||||
- Viewport adjustment (`adjust_viewport_to_content`).
|
||||
- Full page scanning (`scan_full_page`).
|
||||
- Dynamic content loading.
|
||||
|
||||
#### **Refactoring and Cleanup**
|
||||
- Removed hardcoded viewport dimensions in multiple places, replaced with dynamic values (`self.viewport_width`, `self.viewport_height`).
|
||||
- Removed commented-out and unused code for better readability.
|
||||
- Added default value for `delay_before_return_html` parameter.
|
||||
|
||||
#### **Optimizations**
|
||||
- Reduced resource usage in `light_mode` by disabling unnecessary browser features such as extensions, background timers, and sync.
|
||||
- Improved compatibility for different browser types (`chrome`, `firefox`, `webkit`).
|
||||
|
||||
---
|
||||
|
||||
### **File: `docs/examples/quickstart_async.py`**
|
||||
|
||||
#### **Schema Adjustment**
|
||||
- Changed schema reference for `LLMExtractionStrategy`:
|
||||
- **Old**: `OpenAIModelFee.schema()`
|
||||
- **New**: `OpenAIModelFee.model_json_schema()`
|
||||
- This likely ensures better compatibility with the `OpenAIModelFee` class and its JSON schema.
|
||||
|
||||
#### **Documentation Comments Updated**
|
||||
- Improved extraction instruction for schema-based LLM strategies.
|
||||
|
||||
---
|
||||
|
||||
### **New Features Added**
|
||||
1. **Text-Only Mode**:
|
||||
- Focuses on minimal resource usage by disabling non-essential browser features.
|
||||
2. **Light Mode**:
|
||||
- Optimizes browser for performance by disabling background tasks and unnecessary services.
|
||||
3. **Full Page Scanning**:
|
||||
- Ensures the entire content of a page is crawled, including dynamic elements loaded during scrolling.
|
||||
4. **Dynamic Viewport Adjustment**:
|
||||
- Automatically resizes the viewport to match content dimensions, improving compatibility and rendering accuracy.
|
||||
5. **Session Management**:
|
||||
- Simplifies session handling with better support for persistent and non-persistent contexts.
|
||||
|
||||
---
|
||||
|
||||
### **Bug Fixes**
|
||||
- Fixed potential viewport mismatches by ensuring consistent use of `self.viewport_width` and `self.viewport_height` throughout the code.
|
||||
- Improved robustness of dynamic content loading to avoid timeouts and failed evaluations.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## [0.3.75] December 1, 2024
|
||||
|
||||
### PruningContentFilter
|
||||
|
||||
23
README.md
23
README.md
@@ -11,10 +11,9 @@
|
||||
|
||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||
|
||||
[✨ Check out latest update v0.4.1](#-recent-updates)
|
||||
|
||||
🎉 **Version 0.4.0 is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.0.md)
|
||||
|
||||
[✨ Check out latest update v0.4.0](#-recent-updates)
|
||||
🎉 **Version 0.4.x is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://crawl4ai.com/mkdocs/blog)
|
||||
|
||||
## 🧐 Why Crawl4AI?
|
||||
|
||||
@@ -80,6 +79,7 @@ if __name__ == "__main__":
|
||||
- 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
|
||||
- ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
|
||||
- 🌍 **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
|
||||
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
|
||||
|
||||
</details>
|
||||
|
||||
@@ -95,6 +95,8 @@ if __name__ == "__main__":
|
||||
- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
|
||||
- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
|
||||
- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
|
||||
- 🕵️ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
|
||||
- 🔄 **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
|
||||
|
||||
</details>
|
||||
|
||||
@@ -121,8 +123,6 @@ if __name__ == "__main__":
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
|
||||
## Try it Now!
|
||||
|
||||
✨ Play around with this [](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
|
||||
@@ -626,13 +626,14 @@ async def test_news_crawl():
|
||||
|
||||
## ✨ Recent Updates
|
||||
|
||||
- 🔬 **PruningContentFilter**: New unsupervised filtering strategy for intelligent content extraction based on text density and relevance scoring.
|
||||
- 🧵 **Enhanced Thread Safety**: Improved multi-threaded environment handling with better locks and parallel processing support.
|
||||
- 🤖 **Smart User-Agent Generation**: Advanced user-agent generator with customization options and randomization capabilities.
|
||||
- 📝 **New Blog Launch**: Stay updated with our detailed release notes and technical deep dives at [crawl4ai.com/blog](https://crawl4ai.com/blog).
|
||||
- 🧪 **Expanded Test Coverage**: Comprehensive test suite for both PruningContentFilter and BM25ContentFilter with edge case handling.
|
||||
- 🖼️ **Lazy Load Handling**: Improved support for websites with lazy-loaded images. The crawler now waits for all images to fully load, ensuring no content is missed.
|
||||
- ⚡ **Text-Only Mode**: New mode for fast, lightweight crawling. Disables images, JavaScript, and GPU rendering, improving speed by 3-4x for text-focused crawls.
|
||||
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to fit page content, ensuring accurate rendering and capturing of all elements.
|
||||
- 🔄 **Full-Page Scanning**: Added scrolling support for pages with infinite scroll or dynamic content loading. Ensures every part of the page is captured.
|
||||
- 🧑💻 **Session Reuse**: Introduced `create_session` for efficient crawling by reusing the same browser session across multiple requests.
|
||||
- 🌟 **Light Mode**: Optimized browser performance by disabling unnecessary features like extensions, background timers, and sync processes.
|
||||
|
||||
Read the full details of this release in our [0.4.0 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.0.md).
|
||||
Read the full details of this release in our [0.4.1 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.1.md).
|
||||
|
||||
## 📖 Documentation & Roadmap
|
||||
|
||||
|
||||
@@ -1,2 +1,2 @@
|
||||
# crawl4ai/_version.py
|
||||
__version__ = "0.4.0"
|
||||
__version__ = "0.4.1"
|
||||
|
||||
@@ -220,8 +220,22 @@ class AsyncCrawlerStrategy(ABC):
|
||||
|
||||
class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
def __init__(self, use_cached_html=False, js_code=None, logger = None, **kwargs):
|
||||
self.text_only = kwargs.get("text_only", False)
|
||||
self.light_mode = kwargs.get("light_mode", False)
|
||||
self.logger = logger
|
||||
self.use_cached_html = use_cached_html
|
||||
self.viewport_width = kwargs.get("viewport_width", 800 if self.text_only else 1920)
|
||||
self.viewport_height = kwargs.get("viewport_height", 600 if self.text_only else 1080)
|
||||
|
||||
if self.text_only:
|
||||
self.extra_args = kwargs.get("extra_args", []) + [
|
||||
'--disable-images',
|
||||
'--disable-javascript',
|
||||
'--disable-gpu',
|
||||
'--disable-software-rasterizer',
|
||||
'--disable-dev-shm-usage'
|
||||
]
|
||||
|
||||
self.user_agent = kwargs.get(
|
||||
"user_agent",
|
||||
# "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
|
||||
@@ -300,7 +314,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
else:
|
||||
# If no default context exists, create one
|
||||
self.default_context = await self.browser.new_context(
|
||||
viewport={"width": 1920, "height": 1080}
|
||||
# viewport={"width": 1920, "height": 1080}
|
||||
viewport={"width": self.viewport_width, "height": self.viewport_height}
|
||||
)
|
||||
|
||||
# Set up the default context
|
||||
@@ -334,10 +349,40 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
"--ignore-certificate-errors",
|
||||
"--ignore-certificate-errors-spki-list",
|
||||
"--disable-blink-features=AutomationControlled",
|
||||
|
||||
"--window-position=400,0",
|
||||
f"--window-size={self.viewport_width},{self.viewport_height}",
|
||||
]
|
||||
}
|
||||
|
||||
if self.light_mode:
|
||||
browser_args["args"].extend([
|
||||
# "--disable-background-networking",
|
||||
"--disable-background-timer-throttling",
|
||||
"--disable-backgrounding-occluded-windows",
|
||||
"--disable-breakpad",
|
||||
"--disable-client-side-phishing-detection",
|
||||
"--disable-component-extensions-with-background-pages",
|
||||
"--disable-default-apps",
|
||||
"--disable-extensions",
|
||||
"--disable-features=TranslateUI",
|
||||
"--disable-hang-monitor",
|
||||
"--disable-ipc-flooding-protection",
|
||||
"--disable-popup-blocking",
|
||||
"--disable-prompt-on-repost",
|
||||
"--disable-sync",
|
||||
"--force-color-profile=srgb",
|
||||
"--metrics-recording-only",
|
||||
"--no-first-run",
|
||||
"--password-store=basic",
|
||||
"--use-mock-keychain"
|
||||
])
|
||||
|
||||
if self.text_only:
|
||||
browser_args["args"].extend([
|
||||
'--blink-settings=imagesEnabled=false',
|
||||
'--disable-remote-fonts'
|
||||
])
|
||||
|
||||
# Add channel if specified (try Chrome first)
|
||||
if self.chrome_channel:
|
||||
browser_args["channel"] = self.chrome_channel
|
||||
@@ -367,6 +412,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if self.browser_type == "firefox":
|
||||
self.browser = await self.playwright.firefox.launch(**browser_args)
|
||||
elif self.browser_type == "webkit":
|
||||
if "viewport" not in browser_args:
|
||||
browser_args["viewport"] = {"width": self.viewport_width, "height": self.viewport_height}
|
||||
self.browser = await self.playwright.webkit.launch(**browser_args)
|
||||
else:
|
||||
if self.use_persistent_context and self.user_data_dir:
|
||||
@@ -576,6 +623,38 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
# Return the page object
|
||||
return page
|
||||
|
||||
async def create_session(self, **kwargs) -> str:
|
||||
"""Creates a new browser session and returns its ID."""
|
||||
if not self.browser:
|
||||
await self.start()
|
||||
|
||||
session_id = kwargs.get('session_id') or str(uuid.uuid4())
|
||||
|
||||
if self.use_managed_browser:
|
||||
page = await self.default_context.new_page()
|
||||
self.sessions[session_id] = (self.default_context, page, time.time())
|
||||
else:
|
||||
if self.use_persistent_context and self.browser_type in ["chrome", "chromium"]:
|
||||
context = self.browser
|
||||
page = await context.new_page()
|
||||
else:
|
||||
context = await self.browser.new_context(
|
||||
user_agent=kwargs.get("user_agent", self.user_agent),
|
||||
viewport={"width": self.viewport_width, "height": self.viewport_height},
|
||||
proxy={"server": self.proxy} if self.proxy else None,
|
||||
accept_downloads=self.accept_downloads,
|
||||
ignore_https_errors=True
|
||||
)
|
||||
|
||||
if self.cookies:
|
||||
await context.add_cookies(self.cookies)
|
||||
await context.set_extra_http_headers(self.headers)
|
||||
page = await context.new_page()
|
||||
|
||||
self.sessions[session_id] = (context, page, time.time())
|
||||
|
||||
return session_id
|
||||
|
||||
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
|
||||
"""
|
||||
Crawls a given URL or processes raw HTML/local file content based on the URL prefix.
|
||||
@@ -684,12 +763,11 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if self.use_persistent_context and self.browser_type in ["chrome", "chromium"]:
|
||||
# In persistent context, browser is the context
|
||||
context = self.browser
|
||||
page = await context.new_page()
|
||||
else:
|
||||
# Normal context creation for non-persistent or non-Chrome browsers
|
||||
context = await self.browser.new_context(
|
||||
user_agent=user_agent,
|
||||
viewport={"width": 1200, "height": 800},
|
||||
viewport={"width": self.viewport_width, "height": self.viewport_height},
|
||||
proxy={"server": self.proxy} if self.proxy else None,
|
||||
java_script_enabled=True,
|
||||
accept_downloads=self.accept_downloads,
|
||||
@@ -699,7 +777,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if self.cookies:
|
||||
await context.add_cookies(self.cookies)
|
||||
await context.set_extra_http_headers(self.headers)
|
||||
page = await context.new_page()
|
||||
|
||||
page = await context.new_page()
|
||||
self.sessions[session_id] = (context, page, time.time())
|
||||
else:
|
||||
if self.use_persistent_context and self.browser_type in ["chrome", "chromium"]:
|
||||
@@ -709,7 +788,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
# Normal context creation
|
||||
context = await self.browser.new_context(
|
||||
user_agent=user_agent,
|
||||
viewport={"width": 1920, "height": 1080},
|
||||
# viewport={"width": 1920, "height": 1080},
|
||||
viewport={"width": self.viewport_width, "height": self.viewport_height},
|
||||
proxy={"server": self.proxy} if self.proxy else None,
|
||||
accept_downloads=self.accept_downloads,
|
||||
ignore_https_errors=True # Add this line
|
||||
@@ -763,9 +843,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if self.accept_downloads:
|
||||
page.on("download", lambda download: asyncio.create_task(self._handle_download(download)))
|
||||
|
||||
# if self.verbose:
|
||||
# print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
|
||||
|
||||
if self.use_cached_html:
|
||||
cache_file_path = os.path.join(
|
||||
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
|
||||
@@ -786,7 +863,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
if not kwargs.get("js_only", False):
|
||||
await self.execute_hook('before_goto', page, context = context)
|
||||
|
||||
|
||||
try:
|
||||
response = await page.goto(
|
||||
@@ -798,9 +874,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
except Error as e:
|
||||
raise RuntimeError(f"Failed on navigating ACS-GOTO :\n{str(e)}")
|
||||
|
||||
# response = await page.goto("about:blank")
|
||||
# await page.evaluate(f"window.location.href = '{url}'")
|
||||
|
||||
await self.execute_hook('after_goto', page, context = context)
|
||||
|
||||
# Get status code and headers
|
||||
@@ -853,7 +926,83 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
else:
|
||||
raise Error(f"Body element is hidden: {visibility_info}")
|
||||
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
# CONTENT LOADING ASSURANCE
|
||||
if not self.text_only and (kwargs.get("wait_for_images", True) or kwargs.get("adjust_viewport_to_content", False)):
|
||||
# Wait for network idle after initial load and images to load
|
||||
await page.wait_for_load_state("networkidle")
|
||||
await asyncio.sleep(0.1)
|
||||
await page.wait_for_function("Array.from(document.images).every(img => img.complete)")
|
||||
|
||||
# After initial load, adjust viewport to content size
|
||||
if not self.text_only and kwargs.get("adjust_viewport_to_content", False):
|
||||
try:
|
||||
# Get actual page dimensions
|
||||
page_width = await page.evaluate("document.documentElement.scrollWidth")
|
||||
page_height = await page.evaluate("document.documentElement.scrollHeight")
|
||||
|
||||
target_width = self.viewport_width
|
||||
target_height = int(target_width * page_width / page_height * 0.95)
|
||||
await page.set_viewport_size({"width": target_width, "height": target_height})
|
||||
|
||||
# Compute scale factor
|
||||
# We want the entire page visible: the scale should make both width and height fit
|
||||
scale = min(target_width / page_width, target_height / page_height)
|
||||
|
||||
# Now we call CDP to set metrics.
|
||||
# We tell Chrome that the "device" is page_width x page_height in size,
|
||||
# but we scale it down so everything fits within the real viewport.
|
||||
cdp = await page.context.new_cdp_session(page)
|
||||
await cdp.send('Emulation.setDeviceMetricsOverride', {
|
||||
'width': page_width, # full page width
|
||||
'height': page_height, # full page height
|
||||
'deviceScaleFactor': 1, # keep normal DPR
|
||||
'mobile': False,
|
||||
'scale': scale # scale the entire rendered content
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(
|
||||
message="Failed to adjust viewport to content: {error}",
|
||||
tag="VIEWPORT",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
|
||||
# After viewport adjustment, handle page scanning if requested
|
||||
if kwargs.get("scan_full_page", False):
|
||||
try:
|
||||
viewport_height = page.viewport_size.get("height", self.viewport_height)
|
||||
current_position = viewport_height # Start with one viewport height
|
||||
scroll_delay = kwargs.get("scroll_delay", 0.2)
|
||||
|
||||
# Initial scroll
|
||||
await page.evaluate(f"window.scrollTo(0, {current_position})")
|
||||
await asyncio.sleep(scroll_delay)
|
||||
|
||||
# Get height after first scroll to account for any dynamic content
|
||||
total_height = await page.evaluate("document.documentElement.scrollHeight")
|
||||
|
||||
while current_position < total_height:
|
||||
current_position = min(current_position + viewport_height, total_height)
|
||||
await page.evaluate(f"window.scrollTo(0, {current_position})")
|
||||
await asyncio.sleep(scroll_delay)
|
||||
|
||||
# Check for dynamic content
|
||||
new_height = await page.evaluate("document.documentElement.scrollHeight")
|
||||
if new_height > total_height:
|
||||
total_height = new_height
|
||||
|
||||
# Scroll back to top
|
||||
await page.evaluate("window.scrollTo(0, 0)")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(
|
||||
message="Failed to perform full page scan: {error}",
|
||||
tag="PAGE_SCAN",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
else:
|
||||
# Scroll to the bottom of the page
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
|
||||
js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
|
||||
if js_code:
|
||||
@@ -887,7 +1036,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
# await page.wait_for_load_state('networkidle', timeout=5000)
|
||||
|
||||
# Update image dimensions
|
||||
update_image_dimensions_js = """
|
||||
if not self.text_only:
|
||||
update_image_dimensions_js = """
|
||||
() => {
|
||||
return new Promise((resolve) => {
|
||||
const filterImage = (img) => {
|
||||
@@ -944,26 +1094,26 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
}
|
||||
"""
|
||||
|
||||
try:
|
||||
try:
|
||||
await page.wait_for_load_state(
|
||||
# state="load",
|
||||
state="domcontentloaded",
|
||||
timeout=5
|
||||
try:
|
||||
await page.wait_for_load_state(
|
||||
# state="load",
|
||||
state="domcontentloaded",
|
||||
timeout=5
|
||||
)
|
||||
except PlaywrightTimeoutError:
|
||||
pass
|
||||
await page.evaluate(update_image_dimensions_js)
|
||||
except Exception as e:
|
||||
self.logger.error(
|
||||
message="Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {error}",
|
||||
tag="ERROR",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
except PlaywrightTimeoutError:
|
||||
pass
|
||||
await page.evaluate(update_image_dimensions_js)
|
||||
except Exception as e:
|
||||
self.logger.error(
|
||||
message="Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {error}",
|
||||
tag="ERROR",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
# raise RuntimeError(f"Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {str(e)}")
|
||||
# raise RuntimeError(f"Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {str(e)}")
|
||||
|
||||
# Wait a bit for any onload events to complete
|
||||
await page.wait_for_timeout(100)
|
||||
# await page.wait_for_timeout(100)
|
||||
|
||||
# Process iframes
|
||||
if kwargs.get("process_iframes", False):
|
||||
@@ -971,7 +1121,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
await self.execute_hook('before_retrieve_html', page, context = context)
|
||||
# Check if delay_before_return_html is set then wait for that time
|
||||
delay_before_return_html = kwargs.get("delay_before_return_html")
|
||||
delay_before_return_html = kwargs.get("delay_before_return_html", 0.1)
|
||||
if delay_before_return_html:
|
||||
await asyncio.sleep(delay_before_return_html)
|
||||
|
||||
|
||||
@@ -6,10 +6,11 @@ from concurrent.futures import ThreadPoolExecutor
|
||||
import asyncio, requests, re, os
|
||||
from .config import *
|
||||
from bs4 import element, NavigableString, Comment
|
||||
from bs4 import PageElement, Tag
|
||||
from urllib.parse import urljoin
|
||||
from requests.exceptions import InvalidSchema
|
||||
# from .content_cleaning_strategy import ContentCleaningStrategy
|
||||
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter, PruningContentFilter
|
||||
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter#, HeuristicContentFilter
|
||||
from .markdown_generation_strategy import MarkdownGenerationStrategy, DefaultMarkdownGenerator
|
||||
from .models import MarkdownGenerationResult
|
||||
from .utils import (
|
||||
@@ -80,45 +81,21 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
|
||||
return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)
|
||||
|
||||
|
||||
def _generate_markdown_content(self,
|
||||
cleaned_html: str,
|
||||
html: str,
|
||||
url: str,
|
||||
success: bool,
|
||||
**kwargs) -> Dict[str, Any]:
|
||||
"""Generate markdown content using either new strategy or legacy method.
|
||||
|
||||
Args:
|
||||
cleaned_html: Sanitized HTML content
|
||||
html: Original HTML content
|
||||
url: Base URL of the page
|
||||
success: Whether scraping was successful
|
||||
**kwargs: Additional options including:
|
||||
- markdown_generator: Optional[MarkdownGenerationStrategy]
|
||||
- html2text: Dict[str, Any] options for HTML2Text
|
||||
- content_filter: Optional[RelevantContentFilter]
|
||||
- fit_markdown: bool
|
||||
- fit_markdown_user_query: Optional[str]
|
||||
- fit_markdown_bm25_threshold: float
|
||||
|
||||
Returns:
|
||||
Dict containing markdown content in various formats
|
||||
"""
|
||||
markdown_generator: Optional[MarkdownGenerationStrategy] = kwargs.get('markdown_generator', DefaultMarkdownGenerator())
|
||||
|
||||
if markdown_generator:
|
||||
try:
|
||||
if kwargs.get('fit_markdown', False) and not markdown_generator.content_filter:
|
||||
markdown_generator.content_filter = PruningContentFilter(
|
||||
threshold_type=kwargs.get('fit_markdown_treshold_type', 'fixed'),
|
||||
threshold=kwargs.get('fit_markdown_treshold', 0.48),
|
||||
min_word_threshold=kwargs.get('fit_markdown_min_word_threshold', ),
|
||||
markdown_generator.content_filter = BM25ContentFilter(
|
||||
user_query=kwargs.get('fit_markdown_user_query', None),
|
||||
bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
|
||||
)
|
||||
# markdown_generator.content_filter = BM25ContentFilter(
|
||||
# user_query=kwargs.get('fit_markdown_user_query', None),
|
||||
# bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
|
||||
# )
|
||||
|
||||
markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
|
||||
cleaned_html=cleaned_html,
|
||||
@@ -182,13 +159,335 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
'markdown_v2' : markdown_v2
|
||||
}
|
||||
|
||||
def flatten_nested_elements(self, node):
|
||||
if isinstance(node, NavigableString):
|
||||
return node
|
||||
if len(node.contents) == 1 and isinstance(node.contents[0], Tag) and node.contents[0].name == node.name:
|
||||
return self.flatten_nested_elements(node.contents[0])
|
||||
node.contents = [self.flatten_nested_elements(child) for child in node.contents]
|
||||
return node
|
||||
|
||||
def find_closest_parent_with_useful_text(self, tag, **kwargs):
|
||||
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
|
||||
current_tag = tag
|
||||
while current_tag:
|
||||
current_tag = current_tag.parent
|
||||
# Get the text content of the parent tag
|
||||
if current_tag:
|
||||
text_content = current_tag.get_text(separator=' ',strip=True)
|
||||
# Check if the text content has at least word_count_threshold
|
||||
if len(text_content.split()) >= image_description_min_word_threshold:
|
||||
return text_content
|
||||
return None
|
||||
|
||||
def remove_unwanted_attributes(self, element, important_attrs, keep_data_attributes=False):
|
||||
attrs_to_remove = []
|
||||
for attr in element.attrs:
|
||||
if attr not in important_attrs:
|
||||
if keep_data_attributes:
|
||||
if not attr.startswith('data-'):
|
||||
attrs_to_remove.append(attr)
|
||||
else:
|
||||
attrs_to_remove.append(attr)
|
||||
|
||||
for attr in attrs_to_remove:
|
||||
del element[attr]
|
||||
|
||||
def process_image(self, img, url, index, total_images, **kwargs):
|
||||
parse_srcset = lambda s: [{'url': u.strip().split()[0], 'width': u.strip().split()[-1].rstrip('w')
|
||||
if ' ' in u else None}
|
||||
for u in [f"http{p}" for p in s.split("http") if p]]
|
||||
|
||||
# Constants for checks
|
||||
classes_to_check = frozenset(['button', 'icon', 'logo'])
|
||||
tags_to_check = frozenset(['button', 'input'])
|
||||
|
||||
# Pre-fetch commonly used attributes
|
||||
style = img.get('style', '')
|
||||
alt = img.get('alt', '')
|
||||
src = img.get('src', '')
|
||||
data_src = img.get('data-src', '')
|
||||
width = img.get('width')
|
||||
height = img.get('height')
|
||||
parent = img.parent
|
||||
parent_classes = parent.get('class', [])
|
||||
|
||||
# Quick validation checks
|
||||
if ('display:none' in style or
|
||||
parent.name in tags_to_check or
|
||||
any(c in cls for c in parent_classes for cls in classes_to_check) or
|
||||
any(c in src for c in classes_to_check) or
|
||||
any(c in alt for c in classes_to_check)):
|
||||
return None
|
||||
|
||||
# Quick score calculation
|
||||
score = 0
|
||||
if width and width.isdigit():
|
||||
width_val = int(width)
|
||||
score += 1 if width_val > 150 else 0
|
||||
if height and height.isdigit():
|
||||
height_val = int(height)
|
||||
score += 1 if height_val > 150 else 0
|
||||
if alt:
|
||||
score += 1
|
||||
score += index/total_images < 0.5
|
||||
|
||||
image_format = ''
|
||||
if "data:image/" in src:
|
||||
image_format = src.split(',')[0].split(';')[0].split('/')[1].split(';')[0]
|
||||
else:
|
||||
image_format = os.path.splitext(src)[1].lower().strip('.').split('?')[0]
|
||||
|
||||
if image_format in ('jpg', 'png', 'webp', 'avif'):
|
||||
score += 1
|
||||
|
||||
if score <= kwargs.get('image_score_threshold', IMAGE_SCORE_THRESHOLD):
|
||||
return None
|
||||
|
||||
# Use set for deduplication
|
||||
unique_urls = set()
|
||||
image_variants = []
|
||||
|
||||
# Generate a unique group ID for this set of variants
|
||||
group_id = index
|
||||
|
||||
# Base image info template
|
||||
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
|
||||
base_info = {
|
||||
'alt': alt,
|
||||
'desc': self.find_closest_parent_with_useful_text(img, **kwargs),
|
||||
'score': score,
|
||||
'type': 'image',
|
||||
'group_id': group_id # Group ID for this set of variants
|
||||
}
|
||||
|
||||
# Inline function for adding variants
|
||||
def add_variant(src, width=None):
|
||||
if src and not src.startswith('data:') and src not in unique_urls:
|
||||
unique_urls.add(src)
|
||||
image_variants.append({**base_info, 'src': src, 'width': width})
|
||||
|
||||
# Process all sources
|
||||
add_variant(src)
|
||||
add_variant(data_src)
|
||||
|
||||
# Handle srcset and data-srcset in one pass
|
||||
for attr in ('srcset', 'data-srcset'):
|
||||
if value := img.get(attr):
|
||||
for source in parse_srcset(value):
|
||||
add_variant(source['url'], source['width'])
|
||||
|
||||
# Quick picture element check
|
||||
if picture := img.find_parent('picture'):
|
||||
for source in picture.find_all('source'):
|
||||
if srcset := source.get('srcset'):
|
||||
for src in parse_srcset(srcset):
|
||||
add_variant(src['url'], src['width'])
|
||||
|
||||
# Framework-specific attributes in one pass
|
||||
for attr, value in img.attrs.items():
|
||||
if attr.startswith('data-') and ('src' in attr or 'srcset' in attr) and 'http' in value:
|
||||
add_variant(value)
|
||||
|
||||
return image_variants if image_variants else None
|
||||
|
||||
|
||||
def process_element(self, url, element: PageElement, **kwargs) -> Dict[str, Any]:
|
||||
media = {'images': [], 'videos': [], 'audios': []}
|
||||
internal_links_dict = {}
|
||||
external_links_dict = {}
|
||||
self._process_element(
|
||||
url,
|
||||
element,
|
||||
media,
|
||||
internal_links_dict,
|
||||
external_links_dict,
|
||||
**kwargs
|
||||
)
|
||||
return {
|
||||
'media': media,
|
||||
'internal_links_dict': internal_links_dict,
|
||||
'external_links_dict': external_links_dict
|
||||
}
|
||||
|
||||
def _process_element(self, url, element: PageElement, media: Dict[str, Any], internal_links_dict: Dict[str, Any], external_links_dict: Dict[str, Any], **kwargs) -> bool:
|
||||
try:
|
||||
if isinstance(element, NavigableString):
|
||||
if isinstance(element, Comment):
|
||||
element.extract()
|
||||
return False
|
||||
|
||||
# if element.name == 'img':
|
||||
# process_image(element, url, 0, 1)
|
||||
# return True
|
||||
|
||||
if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
keep_element = False
|
||||
|
||||
exclude_social_media_domains = SOCIAL_MEDIA_DOMAINS + kwargs.get('exclude_social_media_domains', [])
|
||||
exclude_social_media_domains = list(set(exclude_social_media_domains))
|
||||
|
||||
try:
|
||||
if element.name == 'a' and element.get('href'):
|
||||
href = element.get('href', '').strip()
|
||||
if not href: # Skip empty hrefs
|
||||
return False
|
||||
|
||||
url_base = url.split('/')[2]
|
||||
|
||||
# Normalize the URL
|
||||
try:
|
||||
normalized_href = normalize_url(href, url)
|
||||
except ValueError as e:
|
||||
# logging.warning(f"Invalid URL format: {href}, Error: {str(e)}")
|
||||
return False
|
||||
|
||||
link_data = {
|
||||
'href': normalized_href,
|
||||
'text': element.get_text().strip(),
|
||||
'title': element.get('title', '').strip()
|
||||
}
|
||||
|
||||
# Check for duplicates and add to appropriate dictionary
|
||||
is_external = is_external_url(normalized_href, url_base)
|
||||
if is_external:
|
||||
if normalized_href not in external_links_dict:
|
||||
external_links_dict[normalized_href] = link_data
|
||||
else:
|
||||
if normalized_href not in internal_links_dict:
|
||||
internal_links_dict[normalized_href] = link_data
|
||||
|
||||
keep_element = True
|
||||
|
||||
# Handle external link exclusions
|
||||
if is_external:
|
||||
if kwargs.get('exclude_external_links', False):
|
||||
element.decompose()
|
||||
return False
|
||||
elif kwargs.get('exclude_social_media_links', False):
|
||||
if any(domain in normalized_href.lower() for domain in exclude_social_media_domains):
|
||||
element.decompose()
|
||||
return False
|
||||
elif kwargs.get('exclude_domains', []):
|
||||
if any(domain in normalized_href.lower() for domain in kwargs.get('exclude_domains', [])):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"Error processing links: {str(e)}")
|
||||
|
||||
try:
|
||||
if element.name == 'img':
|
||||
potential_sources = ['src', 'data-src', 'srcset' 'data-lazy-src', 'data-original']
|
||||
src = element.get('src', '')
|
||||
while not src and potential_sources:
|
||||
src = element.get(potential_sources.pop(0), '')
|
||||
if not src:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
# If it is srcset pick up the first image
|
||||
if 'srcset' in element.attrs:
|
||||
src = element.attrs['srcset'].split(',')[0].split(' ')[0]
|
||||
|
||||
# Check flag if we should remove external images
|
||||
if kwargs.get('exclude_external_images', False):
|
||||
src_url_base = src.split('/')[2]
|
||||
url_base = url.split('/')[2]
|
||||
if url_base not in src_url_base:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
if not kwargs.get('exclude_external_images', False) and kwargs.get('exclude_social_media_links', False):
|
||||
src_url_base = src.split('/')[2]
|
||||
url_base = url.split('/')[2]
|
||||
if any(domain in src for domain in exclude_social_media_domains):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
# Handle exclude domains
|
||||
if kwargs.get('exclude_domains', []):
|
||||
if any(domain in src for domain in kwargs.get('exclude_domains', [])):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
return True # Always keep image elements
|
||||
except Exception as e:
|
||||
raise "Error processing images"
|
||||
|
||||
|
||||
# Check if flag to remove all forms is set
|
||||
if kwargs.get('remove_forms', False) and element.name == 'form':
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
if element.name in ['video', 'audio']:
|
||||
media[f"{element.name}s"].append({
|
||||
'src': element.get('src'),
|
||||
'alt': element.get('alt'),
|
||||
'type': element.name,
|
||||
'description': self.find_closest_parent_with_useful_text(element, **kwargs)
|
||||
})
|
||||
source_tags = element.find_all('source')
|
||||
for source_tag in source_tags:
|
||||
media[f"{element.name}s"].append({
|
||||
'src': source_tag.get('src'),
|
||||
'alt': element.get('alt'),
|
||||
'type': element.name,
|
||||
'description': self.find_closest_parent_with_useful_text(element, **kwargs)
|
||||
})
|
||||
return True # Always keep video and audio elements
|
||||
|
||||
if element.name in ONLY_TEXT_ELIGIBLE_TAGS:
|
||||
if kwargs.get('only_text', False):
|
||||
element.replace_with(element.get_text())
|
||||
|
||||
try:
|
||||
self.remove_unwanted_attributes(element, IMPORTANT_ATTRS, kwargs.get('keep_data_attributes', False))
|
||||
except Exception as e:
|
||||
# print('Error removing unwanted attributes:', str(e))
|
||||
self._log('error',
|
||||
message="Error removing unwanted attributes: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
# Process children
|
||||
for child in list(element.children):
|
||||
if isinstance(child, NavigableString) and not isinstance(child, Comment):
|
||||
if len(child.strip()) > 0:
|
||||
keep_element = True
|
||||
else:
|
||||
if self._process_element(url, child, media, internal_links_dict, external_links_dict, **kwargs):
|
||||
keep_element = True
|
||||
|
||||
|
||||
# Check word count
|
||||
word_count_threshold = kwargs.get('word_count_threshold', MIN_WORD_THRESHOLD)
|
||||
if not keep_element:
|
||||
word_count = len(element.get_text(strip=True).split())
|
||||
keep_element = word_count >= word_count_threshold
|
||||
|
||||
if not keep_element:
|
||||
element.decompose()
|
||||
|
||||
return keep_element
|
||||
except Exception as e:
|
||||
# print('Error processing element:', str(e))
|
||||
self._log('error',
|
||||
message="Error processing element: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
return False
|
||||
|
||||
def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
|
||||
success = True
|
||||
if not html:
|
||||
return None
|
||||
|
||||
# soup = BeautifulSoup(html, 'html.parser')
|
||||
soup = BeautifulSoup(html, 'lxml')
|
||||
body = soup.body
|
||||
|
||||
@@ -200,15 +499,24 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
# print('Error extracting metadata:', str(e))
|
||||
meta = {}
|
||||
|
||||
# Handle tag-based removal first - faster than CSS selection
|
||||
excluded_tags = set(kwargs.get('excluded_tags', []) or [])
|
||||
if excluded_tags:
|
||||
for element in body.find_all(lambda tag: tag.name in excluded_tags):
|
||||
element.extract()
|
||||
|
||||
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
|
||||
|
||||
for tag in kwargs.get('excluded_tags', []) or []:
|
||||
for el in body.select(tag):
|
||||
el.decompose()
|
||||
# Handle CSS selector-based removal
|
||||
excluded_selector = kwargs.get('excluded_selector', '')
|
||||
if excluded_selector:
|
||||
is_single_selector = ',' not in excluded_selector and ' ' not in excluded_selector
|
||||
if is_single_selector:
|
||||
while element := body.select_one(excluded_selector):
|
||||
element.extract()
|
||||
else:
|
||||
for element in body.select(excluded_selector):
|
||||
element.extract()
|
||||
|
||||
if css_selector:
|
||||
selected_elements = body.select(css_selector)
|
||||
@@ -227,384 +535,17 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
for el in selected_elements:
|
||||
body.append(el)
|
||||
|
||||
links = {'internal': [], 'external': []}
|
||||
media = {'images': [], 'videos': [], 'audios': []}
|
||||
internal_links_dict = {}
|
||||
external_links_dict = {}
|
||||
|
||||
# Extract meaningful text for media files from closest parent
|
||||
def find_closest_parent_with_useful_text(tag):
|
||||
current_tag = tag
|
||||
while current_tag:
|
||||
current_tag = current_tag.parent
|
||||
# Get the text content of the parent tag
|
||||
if current_tag:
|
||||
text_content = current_tag.get_text(separator=' ',strip=True)
|
||||
# Check if the text content has at least word_count_threshold
|
||||
if len(text_content.split()) >= image_description_min_word_threshold:
|
||||
return text_content
|
||||
return None
|
||||
|
||||
def process_image_old(img, url, index, total_images):
|
||||
|
||||
|
||||
#Check if an image has valid display and inside undesired html elements
|
||||
def is_valid_image(img, parent, parent_classes):
|
||||
style = img.get('style', '')
|
||||
src = img.get('src', '')
|
||||
classes_to_check = ['button', 'icon', 'logo']
|
||||
tags_to_check = ['button', 'input']
|
||||
return all([
|
||||
'display:none' not in style,
|
||||
src,
|
||||
not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
|
||||
parent.name not in tags_to_check
|
||||
])
|
||||
|
||||
#Score an image for it's usefulness
|
||||
def score_image_for_usefulness(img, base_url, index, images_count):
|
||||
image_height = img.get('height')
|
||||
height_value, height_unit = parse_dimension(image_height)
|
||||
image_width = img.get('width')
|
||||
width_value, width_unit = parse_dimension(image_width)
|
||||
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
|
||||
image_src = img.get('src','')
|
||||
if "data:image/" in image_src:
|
||||
image_format = image_src.split(',')[0].split(';')[0].split('/')[1]
|
||||
else:
|
||||
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
||||
# Remove . from format
|
||||
image_format = image_format.strip('.').split('?')[0]
|
||||
score = 0
|
||||
if height_value:
|
||||
if height_unit == 'px' and height_value > 150:
|
||||
score += 1
|
||||
if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
|
||||
score += 1
|
||||
if width_value:
|
||||
if width_unit == 'px' and width_value > 150:
|
||||
score += 1
|
||||
if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
|
||||
score += 1
|
||||
if image_size > 10000:
|
||||
score += 1
|
||||
if img.get('alt') != '':
|
||||
score+=1
|
||||
if any(image_format==format for format in ['jpg','png','webp']):
|
||||
score+=1
|
||||
if index/images_count<0.5:
|
||||
score+=1
|
||||
return score
|
||||
|
||||
if not is_valid_image(img, img.parent, img.parent.get('class', [])):
|
||||
return None
|
||||
|
||||
score = score_image_for_usefulness(img, url, index, total_images)
|
||||
if score <= kwargs.get('image_score_threshold', IMAGE_SCORE_THRESHOLD):
|
||||
return None
|
||||
|
||||
base_result = {
|
||||
'src': img.get('src', ''),
|
||||
'data-src': img.get('data-src', ''),
|
||||
'alt': img.get('alt', ''),
|
||||
'desc': find_closest_parent_with_useful_text(img),
|
||||
'score': score,
|
||||
'type': 'image'
|
||||
}
|
||||
|
||||
sources = []
|
||||
srcset = img.get('srcset', '')
|
||||
if srcset:
|
||||
sources = parse_srcset(srcset)
|
||||
if sources:
|
||||
return [dict(base_result, src=source['url'], width=source['width'])
|
||||
for source in sources]
|
||||
|
||||
return [base_result] # Always return a list
|
||||
|
||||
def process_image(img, url, index, total_images):
|
||||
parse_srcset = lambda s: [{'url': u.strip().split()[0], 'width': u.strip().split()[-1].rstrip('w')
|
||||
if ' ' in u else None}
|
||||
for u in [f"http{p}" for p in s.split("http") if p]]
|
||||
|
||||
# Constants for checks
|
||||
classes_to_check = frozenset(['button', 'icon', 'logo'])
|
||||
tags_to_check = frozenset(['button', 'input'])
|
||||
|
||||
# Pre-fetch commonly used attributes
|
||||
style = img.get('style', '')
|
||||
alt = img.get('alt', '')
|
||||
src = img.get('src', '')
|
||||
data_src = img.get('data-src', '')
|
||||
width = img.get('width')
|
||||
height = img.get('height')
|
||||
parent = img.parent
|
||||
parent_classes = parent.get('class', [])
|
||||
|
||||
# Quick validation checks
|
||||
if ('display:none' in style or
|
||||
parent.name in tags_to_check or
|
||||
any(c in cls for c in parent_classes for cls in classes_to_check) or
|
||||
any(c in src for c in classes_to_check) or
|
||||
any(c in alt for c in classes_to_check)):
|
||||
return None
|
||||
|
||||
# Quick score calculation
|
||||
score = 0
|
||||
if width and width.isdigit():
|
||||
width_val = int(width)
|
||||
score += 1 if width_val > 150 else 0
|
||||
if height and height.isdigit():
|
||||
height_val = int(height)
|
||||
score += 1 if height_val > 150 else 0
|
||||
if alt:
|
||||
score += 1
|
||||
score += index/total_images < 0.5
|
||||
|
||||
image_format = ''
|
||||
if "data:image/" in src:
|
||||
image_format = src.split(',')[0].split(';')[0].split('/')[1].split(';')[0]
|
||||
else:
|
||||
image_format = os.path.splitext(src)[1].lower().strip('.').split('?')[0]
|
||||
|
||||
if image_format in ('jpg', 'png', 'webp', 'avif'):
|
||||
score += 1
|
||||
|
||||
if score <= kwargs.get('image_score_threshold', IMAGE_SCORE_THRESHOLD):
|
||||
return None
|
||||
|
||||
# Use set for deduplication
|
||||
unique_urls = set()
|
||||
image_variants = []
|
||||
|
||||
# Generate a unique group ID for this set of variants
|
||||
group_id = index
|
||||
|
||||
# Base image info template
|
||||
base_info = {
|
||||
'alt': alt,
|
||||
'desc': find_closest_parent_with_useful_text(img),
|
||||
'score': score,
|
||||
'type': 'image',
|
||||
'group_id': group_id # Group ID for this set of variants
|
||||
}
|
||||
|
||||
# Inline function for adding variants
|
||||
def add_variant(src, width=None):
|
||||
if src and not src.startswith('data:') and src not in unique_urls:
|
||||
unique_urls.add(src)
|
||||
image_variants.append({**base_info, 'src': src, 'width': width})
|
||||
|
||||
# Process all sources
|
||||
add_variant(src)
|
||||
add_variant(data_src)
|
||||
|
||||
# Handle srcset and data-srcset in one pass
|
||||
for attr in ('srcset', 'data-srcset'):
|
||||
if value := img.get(attr):
|
||||
for source in parse_srcset(value):
|
||||
add_variant(source['url'], source['width'])
|
||||
|
||||
# Quick picture element check
|
||||
if picture := img.find_parent('picture'):
|
||||
for source in picture.find_all('source'):
|
||||
if srcset := source.get('srcset'):
|
||||
for src in parse_srcset(srcset):
|
||||
add_variant(src['url'], src['width'])
|
||||
|
||||
# Framework-specific attributes in one pass
|
||||
for attr, value in img.attrs.items():
|
||||
if attr.startswith('data-') and ('src' in attr or 'srcset' in attr) and 'http' in value:
|
||||
add_variant(value)
|
||||
|
||||
return image_variants if image_variants else None
|
||||
|
||||
def remove_unwanted_attributes(element, important_attrs, keep_data_attributes=False):
|
||||
attrs_to_remove = []
|
||||
for attr in element.attrs:
|
||||
if attr not in important_attrs:
|
||||
if keep_data_attributes:
|
||||
if not attr.startswith('data-'):
|
||||
attrs_to_remove.append(attr)
|
||||
else:
|
||||
attrs_to_remove.append(attr)
|
||||
|
||||
for attr in attrs_to_remove:
|
||||
del element[attr]
|
||||
result_obj = self.process_element(
|
||||
url,
|
||||
body,
|
||||
word_count_threshold = word_count_threshold,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
def process_element(element: element.PageElement) -> bool:
|
||||
try:
|
||||
if isinstance(element, NavigableString):
|
||||
if isinstance(element, Comment):
|
||||
element.extract()
|
||||
return False
|
||||
|
||||
# if element.name == 'img':
|
||||
# process_image(element, url, 0, 1)
|
||||
# return True
|
||||
|
||||
if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
keep_element = False
|
||||
|
||||
exclude_social_media_domains = SOCIAL_MEDIA_DOMAINS + kwargs.get('exclude_social_media_domains', [])
|
||||
exclude_social_media_domains = list(set(exclude_social_media_domains))
|
||||
|
||||
try:
|
||||
if element.name == 'a' and element.get('href'):
|
||||
href = element.get('href', '').strip()
|
||||
if not href: # Skip empty hrefs
|
||||
return False
|
||||
|
||||
url_base = url.split('/')[2]
|
||||
|
||||
# Normalize the URL
|
||||
try:
|
||||
normalized_href = normalize_url(href, url)
|
||||
except ValueError as e:
|
||||
# logging.warning(f"Invalid URL format: {href}, Error: {str(e)}")
|
||||
return False
|
||||
|
||||
link_data = {
|
||||
'href': normalized_href,
|
||||
'text': element.get_text().strip(),
|
||||
'title': element.get('title', '').strip()
|
||||
}
|
||||
|
||||
# Check for duplicates and add to appropriate dictionary
|
||||
is_external = is_external_url(normalized_href, url_base)
|
||||
if is_external:
|
||||
if normalized_href not in external_links_dict:
|
||||
external_links_dict[normalized_href] = link_data
|
||||
else:
|
||||
if normalized_href not in internal_links_dict:
|
||||
internal_links_dict[normalized_href] = link_data
|
||||
|
||||
keep_element = True
|
||||
|
||||
# Handle external link exclusions
|
||||
if is_external:
|
||||
if kwargs.get('exclude_external_links', False):
|
||||
element.decompose()
|
||||
return False
|
||||
elif kwargs.get('exclude_social_media_links', False):
|
||||
if any(domain in normalized_href.lower() for domain in exclude_social_media_domains):
|
||||
element.decompose()
|
||||
return False
|
||||
elif kwargs.get('exclude_domains', []):
|
||||
if any(domain in normalized_href.lower() for domain in kwargs.get('exclude_domains', [])):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"Error processing links: {str(e)}")
|
||||
|
||||
try:
|
||||
if element.name == 'img':
|
||||
potential_sources = ['src', 'data-src', 'srcset' 'data-lazy-src', 'data-original']
|
||||
src = element.get('src', '')
|
||||
while not src and potential_sources:
|
||||
src = element.get(potential_sources.pop(0), '')
|
||||
if not src:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
# If it is srcset pick up the first image
|
||||
if 'srcset' in element.attrs:
|
||||
src = element.attrs['srcset'].split(',')[0].split(' ')[0]
|
||||
|
||||
# Check flag if we should remove external images
|
||||
if kwargs.get('exclude_external_images', False):
|
||||
src_url_base = src.split('/')[2]
|
||||
url_base = url.split('/')[2]
|
||||
if url_base not in src_url_base:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
if not kwargs.get('exclude_external_images', False) and kwargs.get('exclude_social_media_links', False):
|
||||
src_url_base = src.split('/')[2]
|
||||
url_base = url.split('/')[2]
|
||||
if any(domain in src for domain in exclude_social_media_domains):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
# Handle exclude domains
|
||||
if kwargs.get('exclude_domains', []):
|
||||
if any(domain in src for domain in kwargs.get('exclude_domains', [])):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
return True # Always keep image elements
|
||||
except Exception as e:
|
||||
raise "Error processing images"
|
||||
|
||||
|
||||
# Check if flag to remove all forms is set
|
||||
if kwargs.get('remove_forms', False) and element.name == 'form':
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
if element.name in ['video', 'audio']:
|
||||
media[f"{element.name}s"].append({
|
||||
'src': element.get('src'),
|
||||
'alt': element.get('alt'),
|
||||
'type': element.name,
|
||||
'description': find_closest_parent_with_useful_text(element)
|
||||
})
|
||||
source_tags = element.find_all('source')
|
||||
for source_tag in source_tags:
|
||||
media[f"{element.name}s"].append({
|
||||
'src': source_tag.get('src'),
|
||||
'alt': element.get('alt'),
|
||||
'type': element.name,
|
||||
'description': find_closest_parent_with_useful_text(element)
|
||||
})
|
||||
return True # Always keep video and audio elements
|
||||
|
||||
if element.name in ONLY_TEXT_ELIGIBLE_TAGS:
|
||||
if kwargs.get('only_text', False):
|
||||
element.replace_with(element.get_text())
|
||||
|
||||
try:
|
||||
remove_unwanted_attributes(element, IMPORTANT_ATTRS, kwargs.get('keep_data_attributes', False))
|
||||
except Exception as e:
|
||||
# print('Error removing unwanted attributes:', str(e))
|
||||
self._log('error',
|
||||
message="Error removing unwanted attributes: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
# Process children
|
||||
for child in list(element.children):
|
||||
if isinstance(child, NavigableString) and not isinstance(child, Comment):
|
||||
if len(child.strip()) > 0:
|
||||
keep_element = True
|
||||
else:
|
||||
if process_element(child):
|
||||
keep_element = True
|
||||
|
||||
|
||||
# Check word count
|
||||
if not keep_element:
|
||||
word_count = len(element.get_text(strip=True).split())
|
||||
keep_element = word_count >= word_count_threshold
|
||||
|
||||
if not keep_element:
|
||||
element.decompose()
|
||||
|
||||
return keep_element
|
||||
except Exception as e:
|
||||
# print('Error processing element:', str(e))
|
||||
self._log('error',
|
||||
message="Error processing element: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
return False
|
||||
|
||||
process_element(body)
|
||||
links = {'internal': [], 'external': []}
|
||||
media = result_obj['media']
|
||||
internal_links_dict = result_obj['internal_links_dict']
|
||||
external_links_dict = result_obj['external_links_dict']
|
||||
|
||||
# Update the links dictionary with unique links
|
||||
links['internal'] = list(internal_links_dict.values())
|
||||
@@ -613,23 +554,14 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
# # Process images using ThreadPoolExecutor
|
||||
imgs = body.find_all('img')
|
||||
|
||||
# For test we use for loop instead of thread
|
||||
media['images'] = [
|
||||
img for result in (process_image(img, url, i, len(imgs))
|
||||
img for result in (self.process_image(img, url, i, len(imgs))
|
||||
for i, img in enumerate(imgs))
|
||||
if result is not None
|
||||
for img in result
|
||||
]
|
||||
|
||||
def flatten_nested_elements(node):
|
||||
if isinstance(node, NavigableString):
|
||||
return node
|
||||
if len(node.contents) == 1 and isinstance(node.contents[0], element.Tag) and node.contents[0].name == node.name:
|
||||
return flatten_nested_elements(node.contents[0])
|
||||
node.contents = [flatten_nested_elements(child) for child in node.contents]
|
||||
return node
|
||||
|
||||
body = flatten_nested_elements(body)
|
||||
body = self.flatten_nested_elements(body)
|
||||
base64_pattern = re.compile(r'data:image/[^;]+;base64,([^"]+)')
|
||||
for img in imgs:
|
||||
src = img.get('src', '')
|
||||
|
||||
@@ -22,7 +22,7 @@ import textwrap
|
||||
|
||||
from .html2text import HTML2Text
|
||||
class CustomHTML2Text(HTML2Text):
|
||||
def __init__(self, *args, **kwargs):
|
||||
def __init__(self, *args, handle_code_in_pre=False, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
self.inside_pre = False
|
||||
self.inside_code = False
|
||||
@@ -30,6 +30,7 @@ class CustomHTML2Text(HTML2Text):
|
||||
self.current_preserved_tag = None
|
||||
self.preserved_content = []
|
||||
self.preserve_depth = 0
|
||||
self.handle_code_in_pre = handle_code_in_pre
|
||||
|
||||
# Configuration options
|
||||
self.skip_internal_links = False
|
||||
@@ -50,6 +51,8 @@ class CustomHTML2Text(HTML2Text):
|
||||
for key, value in kwargs.items():
|
||||
if key == 'preserve_tags':
|
||||
self.preserve_tags = set(value)
|
||||
elif key == 'handle_code_in_pre':
|
||||
self.handle_code_in_pre = value
|
||||
else:
|
||||
setattr(self, key, value)
|
||||
|
||||
@@ -88,13 +91,21 @@ class CustomHTML2Text(HTML2Text):
|
||||
# Handle pre tags
|
||||
if tag == 'pre':
|
||||
if start:
|
||||
self.o('```\n')
|
||||
self.o('```\n') # Markdown code block start
|
||||
self.inside_pre = True
|
||||
else:
|
||||
self.o('\n```')
|
||||
self.o('\n```\n') # Markdown code block end
|
||||
self.inside_pre = False
|
||||
# elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
|
||||
# pass
|
||||
elif tag == 'code':
|
||||
if self.inside_pre and not self.handle_code_in_pre:
|
||||
# Ignore code tags inside pre blocks if handle_code_in_pre is False
|
||||
return
|
||||
if start:
|
||||
self.o('`') # Markdown inline code start
|
||||
self.inside_code = True
|
||||
else:
|
||||
self.o('`') # Markdown inline code end
|
||||
self.inside_code = False
|
||||
else:
|
||||
super().handle_tag(tag, attrs, start)
|
||||
|
||||
@@ -103,7 +114,39 @@ class CustomHTML2Text(HTML2Text):
|
||||
if self.preserve_depth > 0:
|
||||
self.preserved_content.append(data)
|
||||
return
|
||||
|
||||
if self.inside_pre:
|
||||
# Output the raw content for pre blocks, including content inside code tags
|
||||
self.o(data) # Directly output the data as-is (preserve newlines)
|
||||
return
|
||||
if self.inside_code:
|
||||
# Inline code: no newlines allowed
|
||||
self.o(data.replace('\n', ' '))
|
||||
return
|
||||
|
||||
# Default behavior for other tags
|
||||
super().handle_data(data, entity_char)
|
||||
|
||||
|
||||
# # Handle pre tags
|
||||
# if tag == 'pre':
|
||||
# if start:
|
||||
# self.o('```\n')
|
||||
# self.inside_pre = True
|
||||
# else:
|
||||
# self.o('\n```')
|
||||
# self.inside_pre = False
|
||||
# # elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
|
||||
# # pass
|
||||
# else:
|
||||
# super().handle_tag(tag, attrs, start)
|
||||
|
||||
# def handle_data(self, data, entity_char=False):
|
||||
# """Override handle_data to capture content within preserved tags."""
|
||||
# if self.preserve_depth > 0:
|
||||
# self.preserved_content.append(data)
|
||||
# return
|
||||
# super().handle_data(data, entity_char)
|
||||
class InvalidCSSSelectorError(Exception):
|
||||
pass
|
||||
|
||||
|
||||
0
crawl4ai/utils.scraping.py
Normal file
0
crawl4ai/utils.scraping.py
Normal file
@@ -128,7 +128,7 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider=provider,
|
||||
api_token=api_token,
|
||||
schema=OpenAIModelFee.schema(),
|
||||
schema=OpenAIModelFee.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
||||
@@ -547,6 +547,7 @@ async def generate_knowledge_graph():
|
||||
f.write(result.extracted_content)
|
||||
|
||||
async def fit_markdown_remove_overlay():
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
headless=True, # Set to False to see what is happening
|
||||
verbose=True,
|
||||
@@ -560,13 +561,15 @@ async def fit_markdown_remove_overlay():
|
||||
url='https://www.kidocode.com/degrees/technology',
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0),
|
||||
content_filter=PruningContentFilter(
|
||||
threshold=0.48, threshold_type="fixed", min_word_threshold=0
|
||||
),
|
||||
options={
|
||||
"ignore_links": True
|
||||
}
|
||||
),
|
||||
# markdown_generator=DefaultMarkdownGenerator(
|
||||
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0),
|
||||
# content_filter=BM25ContentFilter(user_query="", bm25_threshold=1.0),
|
||||
# options={
|
||||
# "ignore_links": True
|
||||
# }
|
||||
|
||||
@@ -1,19 +1,28 @@
|
||||
# Crawl4AI Blog
|
||||
|
||||
Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical deep dives, and news about the project.
|
||||
Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical insights, and updates about the project. Whether you're looking for the latest improvements or want to dive deep into web crawling techniques, this is the place.
|
||||
|
||||
## Latest Release
|
||||
|
||||
### [0.4.1 - Smarter Crawling with Lazy-Load Handling, Text-Only Mode, and More](releases/0.4.1.md)
|
||||
*December 8, 2024*
|
||||
|
||||
This release brings major improvements to handling lazy-loaded images, a blazing-fast Text-Only Mode, full-page scanning for infinite scrolls, dynamic viewport adjustments, and session reuse for efficient crawling. If you're looking to improve speed, reliability, or handle dynamic content with ease, this update has you covered.
|
||||
|
||||
[Read full release notes →](releases/0.4.1.md)
|
||||
|
||||
---
|
||||
|
||||
### [0.4.0 - Major Content Filtering Update](releases/0.4.0.md)
|
||||
*December 1, 2024*
|
||||
|
||||
Introducing significant improvements to content filtering, multi-threaded environment handling, and user-agent generation. This release features the new PruningContentFilter, enhanced thread safety, and improved test coverage.
|
||||
Introduced significant improvements to content filtering, multi-threaded environment handling, and user-agent generation. This release features the new PruningContentFilter, enhanced thread safety, and improved test coverage.
|
||||
|
||||
[Read full release notes →](releases/0.4.0.md)
|
||||
|
||||
## Project History
|
||||
|
||||
Want to see how we got here? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) covering all previous versions and the evolution of Crawl4AI.
|
||||
Curious about how Crawl4AI has evolved? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) for a detailed history of all versions and updates.
|
||||
|
||||
## Categories
|
||||
|
||||
|
||||
145
docs/md_v2/blog/releases/0.4.1.md
Normal file
145
docs/md_v2/blog/releases/0.4.1.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# Release Summary for Version 0.4.1 (December 8, 2024): Major Efficiency Boosts with New Features!
|
||||
|
||||
_This post was generated with the help of ChatGPT, take everything with a grain of salt. 🧂_
|
||||
|
||||
Hi everyone,
|
||||
|
||||
I just finished putting together version 0.4.1 of Crawl4AI, and there are a few changes in here that I think you’ll find really helpful. I’ll explain what’s new, why it matters, and exactly how you can use these features (with the code to back it up). Let’s get into it.
|
||||
|
||||
---
|
||||
|
||||
### Handling Lazy Loading Better (Images Included)
|
||||
|
||||
One thing that always bugged me with crawlers is how often they miss lazy-loaded content, especially images. In this version, I made sure Crawl4AI **waits for all images to load** before moving forward. This is useful because many modern websites only load images when they’re in the viewport or after some JavaScript executes.
|
||||
|
||||
Here’s how to enable it:
|
||||
|
||||
```python
|
||||
await crawler.crawl(
|
||||
url="https://example.com",
|
||||
wait_for_images=True # Add this argument to ensure images are fully loaded
|
||||
)
|
||||
```
|
||||
|
||||
What this does is:
|
||||
1. Waits for the page to reach a "network idle" state.
|
||||
2. Ensures all images on the page have been completely loaded.
|
||||
|
||||
This single change handles the majority of lazy-loading cases you’re likely to encounter.
|
||||
|
||||
---
|
||||
|
||||
### Text-Only Mode (Fast, Lightweight Crawling)
|
||||
|
||||
Sometimes, you don’t need to download images or process JavaScript at all. For example, if you’re crawling to extract text data, you can enable **text-only mode** to speed things up. By disabling images, JavaScript, and other heavy resources, this mode makes crawling **3-4 times faster** in most cases.
|
||||
|
||||
Here’s how to turn it on:
|
||||
|
||||
```python
|
||||
crawler = AsyncPlaywrightCrawlerStrategy(
|
||||
text_only=True # Set this to True to enable text-only crawling
|
||||
)
|
||||
```
|
||||
|
||||
When `text_only=True`, the crawler automatically:
|
||||
- Disables GPU processing.
|
||||
- Blocks image and JavaScript resources.
|
||||
- Reduces the viewport size to 800x600 (you can override this with `viewport_width` and `viewport_height`).
|
||||
|
||||
If you need to crawl thousands of pages where you only care about text, this mode will save you a ton of time and resources.
|
||||
|
||||
---
|
||||
|
||||
### Adjusting the Viewport Dynamically
|
||||
|
||||
Another useful addition is the ability to **dynamically adjust the viewport size** to match the content on the page. This is particularly helpful when you’re working with responsive layouts or want to ensure all parts of the page load properly.
|
||||
|
||||
Here’s how it works:
|
||||
1. The crawler calculates the page’s width and height after it loads.
|
||||
2. It adjusts the viewport to fit the content dimensions.
|
||||
3. (Optional) It uses Chrome DevTools Protocol (CDP) to simulate zooming out so everything fits in the viewport.
|
||||
|
||||
To enable this, use:
|
||||
|
||||
```python
|
||||
await crawler.crawl(
|
||||
url="https://example.com",
|
||||
adjust_viewport_to_content=True # Dynamically adjusts the viewport
|
||||
)
|
||||
```
|
||||
|
||||
This approach makes sure the entire page gets loaded into the viewport, especially for layouts that load content based on visibility.
|
||||
|
||||
---
|
||||
|
||||
### Simulating Full-Page Scrolling
|
||||
|
||||
Some websites load data dynamically as you scroll down the page. To handle these cases, I added support for **full-page scanning**. It simulates scrolling to the bottom of the page, checking for new content, and capturing it all.
|
||||
|
||||
Here’s an example:
|
||||
|
||||
```python
|
||||
await crawler.crawl(
|
||||
url="https://example.com",
|
||||
scan_full_page=True, # Enables scrolling
|
||||
scroll_delay=0.2 # Waits 200ms between scrolls (optional)
|
||||
)
|
||||
```
|
||||
|
||||
What happens here:
|
||||
1. The crawler scrolls down in increments, waiting for content to load after each scroll.
|
||||
2. It stops when no new content appears (i.e., dynamic elements stop loading).
|
||||
3. It scrolls back to the top before finishing (if necessary).
|
||||
|
||||
If you’ve ever had to deal with infinite scroll pages, this is going to save you a lot of headaches.
|
||||
|
||||
---
|
||||
|
||||
### Reusing Browser Sessions (Save Time on Setup)
|
||||
|
||||
By default, every time you crawl a page, a new browser context (or tab) is created. That’s fine for small crawls, but if you’re working on a large dataset, it’s more efficient to reuse the same session.
|
||||
|
||||
I added a method called `create_session` for this:
|
||||
|
||||
```python
|
||||
session_id = await crawler.create_session()
|
||||
|
||||
# Use the same session for multiple crawls
|
||||
await crawler.crawl(
|
||||
url="https://example.com/page1",
|
||||
session_id=session_id # Reuse the session
|
||||
)
|
||||
await crawler.crawl(
|
||||
url="https://example.com/page2",
|
||||
session_id=session_id
|
||||
)
|
||||
```
|
||||
|
||||
This avoids creating a new tab for every page, speeding up the crawl and reducing memory usage.
|
||||
|
||||
---
|
||||
|
||||
### Other Updates
|
||||
|
||||
Here are a few smaller updates I’ve made:
|
||||
- **Light Mode**: Use `light_mode=True` to disable background processes, extensions, and other unnecessary features, making the browser more efficient.
|
||||
- **Logging**: Improved logs to make debugging easier.
|
||||
- **Defaults**: Added sensible defaults for things like `delay_before_return_html` (now set to 0.1 seconds).
|
||||
|
||||
---
|
||||
|
||||
### How to Get the Update
|
||||
|
||||
You can install or upgrade to version `0.4.1` like this:
|
||||
|
||||
```bash
|
||||
pip install crawl4ai --upgrade
|
||||
```
|
||||
|
||||
As always, I’d love to hear your thoughts. If there’s something you think could be improved or if you have suggestions for future versions, let me know!
|
||||
|
||||
Enjoy the new features, and happy crawling! 🕷️
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -12,7 +12,7 @@ nav:
|
||||
- 'Quick Start': 'basic/quickstart.md'
|
||||
- Changelog & Blog:
|
||||
- 'Blog Home': 'blog/index.md'
|
||||
- 'Latest (0.4.0)': 'blog/releases/0.4.0.md'
|
||||
- 'Latest (0.4.1)': 'blog/releases/0.4.1.md'
|
||||
- 'Changelog': 'https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md'
|
||||
|
||||
- Basic:
|
||||
|
||||
Reference in New Issue
Block a user