Merge branch 'next'
This commit is contained in:
50
CHANGELOG.md
50
CHANGELOG.md
@@ -1,5 +1,55 @@
|
||||
# Changelog
|
||||
|
||||
## [0.3.75] December 1, 2024
|
||||
|
||||
### PruningContentFilter
|
||||
|
||||
#### 1. Introduced PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
|
||||
A new content filtering strategy that removes less relevant nodes based on metrics like text and link density.
|
||||
|
||||
**Affected Files:**
|
||||
- `crawl4ai/content_filter_strategy.py`: Enhancement of content filtering capabilities.
|
||||
```diff
|
||||
Implemented effective pruning algorithm with comprehensive scoring.
|
||||
```
|
||||
- `README.md`: Improved documentation regarding new features.
|
||||
```diff
|
||||
Updated to include usage and explanation for the PruningContentFilter.
|
||||
```
|
||||
- `docs/md_v2/basic/content_filtering.md`: Expanded documentation for users.
|
||||
```diff
|
||||
Added detailed section explaining the PruningContentFilter.
|
||||
```
|
||||
|
||||
#### 2. Added Unit Tests for PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
|
||||
Comprehensive tests added to ensure correct functionality of PruningContentFilter
|
||||
|
||||
**Affected Files:**
|
||||
- `tests/async/test_content_filter_prune.py`: Increased test coverage for content filtering strategies.
|
||||
```diff
|
||||
Created test cases for various scenarios using the PruningContentFilter.
|
||||
```
|
||||
|
||||
### Development Updates
|
||||
|
||||
#### 3. Enhanced BM25ContentFilter tests (Dec 01, 2024) (Dec 01, 2024)
|
||||
Extended testing to cover additional edge cases and performance metrics.
|
||||
|
||||
**Affected Files:**
|
||||
- `tests/async/test_content_filter_bm25.py`: Improved reliability and performance assurance.
|
||||
```diff
|
||||
Added tests for new extraction scenarios including malformed HTML.
|
||||
```
|
||||
|
||||
### Infrastructure & Documentation
|
||||
|
||||
#### 4. Updated Examples (Dec 01, 2024) (Dec 01, 2024)
|
||||
Altered examples in documentation to promote the use of PruningContentFilter alongside existing strategies.
|
||||
|
||||
**Affected Files:**
|
||||
- `docs/examples/quickstart_async.py`: Enhanced usability and clarity for new users.
|
||||
- Revised example to illustrate usage of PruningContentFilter.
|
||||
|
||||
## [0.3.746] November 29, 2024
|
||||
|
||||
### Major Features
|
||||
|
||||
@@ -18,6 +18,7 @@ We would like to thank the following people for their contributions to Crawl4AI:
|
||||
|
||||
## Pull Requests
|
||||
|
||||
- [dvschuyl](https://github.com/dvschuyl) - AsyncPlaywrightCrawlerStrategy page-evaluate context destroyed by navigation [#304](https://github.com/unclecode/crawl4ai/pull/304)
|
||||
- [nelzomal](https://github.com/nelzomal) - Enhance development installation instructions [#286](https://github.com/unclecode/crawl4ai/pull/286)
|
||||
- [HamzaFarhan](https://github.com/HamzaFarhan) - Handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined [#293](https://github.com/unclecode/crawl4ai/pull/293)
|
||||
- [NanmiCoder](https://github.com/NanmiCoder) - fix: crawler strategy exception handling and fixes [#271](https://github.com/unclecode/crawl4ai/pull/271)
|
||||
|
||||
29
README.md
29
README.md
@@ -11,7 +11,10 @@
|
||||
|
||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||
|
||||
[✨ Check out latest update v0.3.745](#-recent-updates)
|
||||
|
||||
🎉 **Version 0.4.0 is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.0.md)
|
||||
|
||||
[✨ Check out latest update v0.4.0](#-recent-updates)
|
||||
|
||||
## 🧐 Why Crawl4AI?
|
||||
|
||||
@@ -422,7 +425,7 @@ You can check the project structure in the directory [https://github.com/uncleco
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def main():
|
||||
@@ -434,8 +437,11 @@ async def main():
|
||||
url="https://docs.micronaut.io/4.7.6/guide/",
|
||||
cache_mode=CacheMode.ENABLED,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
||||
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
||||
),
|
||||
# markdown_generator=DefaultMarkdownGenerator(
|
||||
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
|
||||
# ),
|
||||
)
|
||||
print(len(result.markdown))
|
||||
print(len(result.fit_markdown))
|
||||
@@ -620,18 +626,21 @@ async def test_news_crawl():
|
||||
|
||||
## ✨ Recent Updates
|
||||
|
||||
- 🚀 **Improved ManagedBrowser Configuration**: Dynamic host and port support for more flexible browser management.
|
||||
- 📝 **Enhanced Markdown Generation**: New generator class for better formatting and customization.
|
||||
- ⚡ **Fast HTML Formatting**: Significantly optimized HTML formatting in the web crawler.
|
||||
- 🛠️ **Utility & Sanitization Upgrades**: Improved sanitization and expanded utility functions for streamlined workflows.
|
||||
- 👥 **Acknowledgments**: Added contributor details and pull request acknowledgments for better transparency.
|
||||
- 🔬 **PruningContentFilter**: New unsupervised filtering strategy for intelligent content extraction based on text density and relevance scoring.
|
||||
- 🧵 **Enhanced Thread Safety**: Improved multi-threaded environment handling with better locks and parallel processing support.
|
||||
- 🤖 **Smart User-Agent Generation**: Advanced user-agent generator with customization options and randomization capabilities.
|
||||
- 📝 **New Blog Launch**: Stay updated with our detailed release notes and technical deep dives at [crawl4ai.com/blog](https://crawl4ai.com/blog).
|
||||
- 🧪 **Expanded Test Coverage**: Comprehensive test suite for both PruningContentFilter and BM25ContentFilter with edge case handling.
|
||||
|
||||
Read the full details of this release in our [0.4.0 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.0.md).
|
||||
|
||||
## 📖 Documentation & Roadmap
|
||||
|
||||
For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
|
||||
> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
|
||||
|
||||
Moreover to check our development plans and upcoming features, check out our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
|
||||
For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
|
||||
|
||||
To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
|
||||
|
||||
<details>
|
||||
<summary>📈 <strong>Development TODOs</strong></summary>
|
||||
|
||||
@@ -1,2 +1,2 @@
|
||||
# crawl4ai/_version.py
|
||||
__version__ = "0.3.746"
|
||||
__version__ = "0.4.0"
|
||||
|
||||
@@ -6,6 +6,7 @@ from typing import Callable, Dict, Any, List, Optional, Awaitable
|
||||
import os, sys, shutil
|
||||
import tempfile, subprocess
|
||||
from playwright.async_api import async_playwright, Page, Browser, Error
|
||||
from playwright.async_api import TimeoutError as PlaywrightTimeoutError
|
||||
from io import BytesIO
|
||||
from PIL import Image, ImageDraw, ImageFont
|
||||
from pathlib import Path
|
||||
@@ -16,6 +17,7 @@ import json
|
||||
import uuid
|
||||
from .models import AsyncCrawlResponse
|
||||
from .utils import create_box_message
|
||||
from .user_agent_generator import UserAgentGenerator
|
||||
from playwright_stealth import StealthConfig, stealth_async
|
||||
|
||||
stealth_config = StealthConfig(
|
||||
@@ -222,14 +224,21 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
self.use_cached_html = use_cached_html
|
||||
self.user_agent = kwargs.get(
|
||||
"user_agent",
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
|
||||
# "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
|
||||
"Mozilla/5.0 (Linux; Android 11; SM-G973F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.120 Mobile Safari/537.36"
|
||||
)
|
||||
user_agenr_generator = UserAgentGenerator()
|
||||
if kwargs.get("user_agent_mode") == "random":
|
||||
self.user_agent = user_agenr_generator.generate(
|
||||
**kwargs.get("user_agent_generator_config", {})
|
||||
)
|
||||
self.proxy = kwargs.get("proxy")
|
||||
self.proxy_config = kwargs.get("proxy_config")
|
||||
self.headless = kwargs.get("headless", True)
|
||||
self.browser_type = kwargs.get("browser_type", "chromium")
|
||||
self.headers = kwargs.get("headers", {})
|
||||
self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent)
|
||||
self.headers.setdefault("sec-ch-ua", self.browser_hint)
|
||||
self.cookies = kwargs.get("cookies", [])
|
||||
self.sessions = {}
|
||||
self.session_ttl = 1800
|
||||
@@ -307,7 +316,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
if self.user_agent:
|
||||
await self.default_context.set_extra_http_headers({
|
||||
"User-Agent": self.user_agent
|
||||
"User-Agent": self.user_agent,
|
||||
"sec-ch-ua": self.browser_hint,
|
||||
# **self.headers
|
||||
})
|
||||
else:
|
||||
# Base browser arguments
|
||||
@@ -321,7 +332,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
"--disable-infobars",
|
||||
"--window-position=0,0",
|
||||
"--ignore-certificate-errors",
|
||||
"--ignore-certificate-errors-spki-list"
|
||||
"--ignore-certificate-errors-spki-list",
|
||||
"--disable-blink-features=AutomationControlled",
|
||||
|
||||
]
|
||||
}
|
||||
|
||||
@@ -642,6 +655,15 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
self._cleanup_expired_sessions()
|
||||
session_id = kwargs.get("session_id")
|
||||
|
||||
# Check if in kwargs we have user_agent that will override the default user_agent
|
||||
user_agent = kwargs.get("user_agent", self.user_agent)
|
||||
|
||||
# Generate random user agent if magic mode is enabled and user_agent_mode is not random
|
||||
if kwargs.get("user_agent_mode") != "random" and kwargs.get("magic", False):
|
||||
user_agent = UserAgentGenerator().generate(
|
||||
**kwargs.get("user_agent_generator_config", {})
|
||||
)
|
||||
|
||||
# Handle page creation differently for managed browser
|
||||
context = None
|
||||
if self.use_managed_browser:
|
||||
@@ -666,7 +688,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
else:
|
||||
# Normal context creation for non-persistent or non-Chrome browsers
|
||||
context = await self.browser.new_context(
|
||||
user_agent=self.user_agent,
|
||||
user_agent=user_agent,
|
||||
viewport={"width": 1200, "height": 800},
|
||||
proxy={"server": self.proxy} if self.proxy else None,
|
||||
java_script_enabled=True,
|
||||
@@ -686,10 +708,11 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
else:
|
||||
# Normal context creation
|
||||
context = await self.browser.new_context(
|
||||
user_agent=self.user_agent,
|
||||
user_agent=user_agent,
|
||||
viewport={"width": 1920, "height": 1080},
|
||||
proxy={"server": self.proxy} if self.proxy else None,
|
||||
accept_downloads=self.accept_downloads,
|
||||
ignore_https_errors=True # Add this line
|
||||
)
|
||||
if self.cookies:
|
||||
await context.add_cookies(self.cookies)
|
||||
@@ -920,7 +943,24 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
});
|
||||
}
|
||||
"""
|
||||
await page.evaluate(update_image_dimensions_js)
|
||||
|
||||
try:
|
||||
try:
|
||||
await page.wait_for_load_state(
|
||||
# state="load",
|
||||
state="domcontentloaded",
|
||||
timeout=5
|
||||
)
|
||||
except PlaywrightTimeoutError:
|
||||
pass
|
||||
await page.evaluate(update_image_dimensions_js)
|
||||
except Exception as e:
|
||||
self.logger.error(
|
||||
message="Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {error}",
|
||||
tag="ERROR",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
# raise RuntimeError(f"Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {str(e)}")
|
||||
|
||||
# Wait a bit for any onload events to complete
|
||||
await page.wait_for_timeout(100)
|
||||
|
||||
@@ -7,6 +7,7 @@ from pathlib import Path
|
||||
from typing import Optional, List, Union
|
||||
import json
|
||||
import asyncio
|
||||
from contextlib import nullcontext
|
||||
from .models import CrawlResult, MarkdownGenerationResult
|
||||
from .async_database import async_db_manager
|
||||
from .chunking_strategy import *
|
||||
@@ -67,6 +68,7 @@ class AsyncWebCrawler:
|
||||
always_bypass_cache: bool = False,
|
||||
always_by_pass_cache: Optional[bool] = None, # Deprecated parameter
|
||||
base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())),
|
||||
thread_safe: bool = False,
|
||||
**kwargs,
|
||||
):
|
||||
"""
|
||||
@@ -104,6 +106,8 @@ class AsyncWebCrawler:
|
||||
else:
|
||||
self.always_bypass_cache = always_bypass_cache
|
||||
|
||||
self._lock = asyncio.Lock() if thread_safe else None
|
||||
|
||||
self.crawl4ai_folder = os.path.join(base_directory, ".crawl4ai")
|
||||
os.makedirs(self.crawl4ai_folder, exist_ok=True)
|
||||
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
|
||||
@@ -178,169 +182,170 @@ class AsyncWebCrawler:
|
||||
Returns:
|
||||
CrawlResult: The result of crawling and processing
|
||||
"""
|
||||
try:
|
||||
# Handle deprecated parameters
|
||||
if any([bypass_cache, disable_cache, no_cache_read, no_cache_write]):
|
||||
if kwargs.get("warning", True):
|
||||
warnings.warn(
|
||||
"Cache control boolean flags are deprecated and will be removed in version X.X.X. "
|
||||
"Use 'cache_mode' parameter instead. Examples:\n"
|
||||
"- For bypass_cache=True, use cache_mode=CacheMode.BYPASS\n"
|
||||
"- For disable_cache=True, use cache_mode=CacheMode.DISABLED\n"
|
||||
"- For no_cache_read=True, use cache_mode=CacheMode.WRITE_ONLY\n"
|
||||
"- For no_cache_write=True, use cache_mode=CacheMode.READ_ONLY\n"
|
||||
"Pass warning=False to suppress this warning.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2
|
||||
)
|
||||
async with self._lock or nullcontext():
|
||||
try:
|
||||
# Handle deprecated parameters
|
||||
if any([bypass_cache, disable_cache, no_cache_read, no_cache_write]):
|
||||
if kwargs.get("warning", True):
|
||||
warnings.warn(
|
||||
"Cache control boolean flags are deprecated and will be removed in version X.X.X. "
|
||||
"Use 'cache_mode' parameter instead. Examples:\n"
|
||||
"- For bypass_cache=True, use cache_mode=CacheMode.BYPASS\n"
|
||||
"- For disable_cache=True, use cache_mode=CacheMode.DISABLED\n"
|
||||
"- For no_cache_read=True, use cache_mode=CacheMode.WRITE_ONLY\n"
|
||||
"- For no_cache_write=True, use cache_mode=CacheMode.READ_ONLY\n"
|
||||
"Pass warning=False to suppress this warning.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2
|
||||
)
|
||||
|
||||
# Convert legacy parameters if cache_mode not provided
|
||||
if cache_mode is None:
|
||||
cache_mode = _legacy_to_cache_mode(
|
||||
disable_cache=disable_cache,
|
||||
bypass_cache=bypass_cache,
|
||||
no_cache_read=no_cache_read,
|
||||
no_cache_write=no_cache_write
|
||||
)
|
||||
|
||||
# Convert legacy parameters if cache_mode not provided
|
||||
# Default to ENABLED if no cache mode specified
|
||||
if cache_mode is None:
|
||||
cache_mode = _legacy_to_cache_mode(
|
||||
disable_cache=disable_cache,
|
||||
bypass_cache=bypass_cache,
|
||||
no_cache_read=no_cache_read,
|
||||
no_cache_write=no_cache_write
|
||||
cache_mode = CacheMode.ENABLED
|
||||
|
||||
# Create cache context
|
||||
cache_context = CacheContext(url, cache_mode, self.always_bypass_cache)
|
||||
|
||||
extraction_strategy = extraction_strategy or NoExtractionStrategy()
|
||||
extraction_strategy.verbose = verbose
|
||||
if not isinstance(extraction_strategy, ExtractionStrategy):
|
||||
raise ValueError("Unsupported extraction strategy")
|
||||
if not isinstance(chunking_strategy, ChunkingStrategy):
|
||||
raise ValueError("Unsupported chunking strategy")
|
||||
|
||||
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
|
||||
|
||||
async_response: AsyncCrawlResponse = None
|
||||
cached_result = None
|
||||
screenshot_data = None
|
||||
extracted_content = None
|
||||
|
||||
start_time = time.perf_counter()
|
||||
|
||||
# Try to get cached result if appropriate
|
||||
if cache_context.should_read():
|
||||
cached_result = await async_db_manager.aget_cached_url(url)
|
||||
|
||||
if cached_result:
|
||||
html = sanitize_input_encode(cached_result.html)
|
||||
extracted_content = sanitize_input_encode(cached_result.extracted_content or "")
|
||||
if screenshot:
|
||||
screenshot_data = cached_result.screenshot
|
||||
if not screenshot_data:
|
||||
cached_result = None
|
||||
# if verbose:
|
||||
# print(f"{Fore.BLUE}{self.tag_format('FETCH')} {self.log_icons['FETCH']} Cache hit for {cache_context.display_url} | Status: {Fore.GREEN if bool(html) else Fore.RED}{bool(html)}{Style.RESET_ALL} | Time: {time.perf_counter() - start_time:.2f}s")
|
||||
self.logger.url_status(
|
||||
url=cache_context.display_url,
|
||||
success=bool(html),
|
||||
timing=time.perf_counter() - start_time,
|
||||
tag="FETCH"
|
||||
)
|
||||
|
||||
|
||||
# Fetch fresh content if needed
|
||||
if not cached_result or not html:
|
||||
t1 = time.perf_counter()
|
||||
|
||||
if user_agent:
|
||||
self.crawler_strategy.update_user_agent(user_agent)
|
||||
async_response: AsyncCrawlResponse = await self.crawler_strategy.crawl(
|
||||
url,
|
||||
screenshot=screenshot,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
# Default to ENABLED if no cache mode specified
|
||||
if cache_mode is None:
|
||||
cache_mode = CacheMode.ENABLED
|
||||
|
||||
# Create cache context
|
||||
cache_context = CacheContext(url, cache_mode, self.always_bypass_cache)
|
||||
|
||||
extraction_strategy = extraction_strategy or NoExtractionStrategy()
|
||||
extraction_strategy.verbose = verbose
|
||||
if not isinstance(extraction_strategy, ExtractionStrategy):
|
||||
raise ValueError("Unsupported extraction strategy")
|
||||
if not isinstance(chunking_strategy, ChunkingStrategy):
|
||||
raise ValueError("Unsupported chunking strategy")
|
||||
|
||||
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
|
||||
|
||||
async_response: AsyncCrawlResponse = None
|
||||
cached_result = None
|
||||
screenshot_data = None
|
||||
extracted_content = None
|
||||
|
||||
start_time = time.perf_counter()
|
||||
|
||||
# Try to get cached result if appropriate
|
||||
if cache_context.should_read():
|
||||
cached_result = await async_db_manager.aget_cached_url(url)
|
||||
|
||||
if cached_result:
|
||||
html = sanitize_input_encode(cached_result.html)
|
||||
extracted_content = sanitize_input_encode(cached_result.extracted_content or "")
|
||||
if screenshot:
|
||||
screenshot_data = cached_result.screenshot
|
||||
if not screenshot_data:
|
||||
cached_result = None
|
||||
# if verbose:
|
||||
# print(f"{Fore.BLUE}{self.tag_format('FETCH')} {self.log_icons['FETCH']} Cache hit for {cache_context.display_url} | Status: {Fore.GREEN if bool(html) else Fore.RED}{bool(html)}{Style.RESET_ALL} | Time: {time.perf_counter() - start_time:.2f}s")
|
||||
self.logger.url_status(
|
||||
html = sanitize_input_encode(async_response.html)
|
||||
screenshot_data = async_response.screenshot
|
||||
t2 = time.perf_counter()
|
||||
self.logger.url_status(
|
||||
url=cache_context.display_url,
|
||||
success=bool(html),
|
||||
timing=time.perf_counter() - start_time,
|
||||
timing=t2 - t1,
|
||||
tag="FETCH"
|
||||
)
|
||||
)
|
||||
# if verbose:
|
||||
# print(f"{Fore.BLUE}{self.tag_format('FETCH')} {self.log_icons['FETCH']} Live fetch for {cache_context.display_url}... | Status: {Fore.GREEN if bool(html) else Fore.RED}{bool(html)}{Style.RESET_ALL} | Time: {t2 - t1:.2f}s")
|
||||
|
||||
|
||||
# Fetch fresh content if needed
|
||||
if not cached_result or not html:
|
||||
t1 = time.perf_counter()
|
||||
# Process the HTML content
|
||||
crawl_result = await self.aprocess_html(
|
||||
url=url,
|
||||
html=html,
|
||||
extracted_content=extracted_content,
|
||||
word_count_threshold=word_count_threshold,
|
||||
extraction_strategy=extraction_strategy,
|
||||
chunking_strategy=chunking_strategy,
|
||||
content_filter=content_filter,
|
||||
css_selector=css_selector,
|
||||
screenshot=screenshot_data,
|
||||
verbose=verbose,
|
||||
is_cached=bool(cached_result),
|
||||
async_response=async_response,
|
||||
is_web_url=cache_context.is_web_url,
|
||||
is_local_file=cache_context.is_local_file,
|
||||
is_raw_html=cache_context.is_raw_html,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
if user_agent:
|
||||
self.crawler_strategy.update_user_agent(user_agent)
|
||||
async_response: AsyncCrawlResponse = await self.crawler_strategy.crawl(
|
||||
url,
|
||||
screenshot=screenshot,
|
||||
**kwargs
|
||||
)
|
||||
html = sanitize_input_encode(async_response.html)
|
||||
screenshot_data = async_response.screenshot
|
||||
t2 = time.perf_counter()
|
||||
self.logger.url_status(
|
||||
url=cache_context.display_url,
|
||||
success=bool(html),
|
||||
timing=t2 - t1,
|
||||
tag="FETCH"
|
||||
)
|
||||
# Set response data
|
||||
if async_response:
|
||||
crawl_result.status_code = async_response.status_code
|
||||
crawl_result.response_headers = async_response.response_headers
|
||||
crawl_result.downloaded_files = async_response.downloaded_files
|
||||
else:
|
||||
crawl_result.status_code = 200
|
||||
crawl_result.response_headers = cached_result.response_headers if cached_result else {}
|
||||
|
||||
crawl_result.success = bool(html)
|
||||
crawl_result.session_id = kwargs.get("session_id", None)
|
||||
|
||||
# if verbose:
|
||||
# print(f"{Fore.BLUE}{self.tag_format('FETCH')} {self.log_icons['FETCH']} Live fetch for {cache_context.display_url}... | Status: {Fore.GREEN if bool(html) else Fore.RED}{bool(html)}{Style.RESET_ALL} | Time: {t2 - t1:.2f}s")
|
||||
# print(f"{Fore.GREEN}{self.tag_format('COMPLETE')} {self.log_icons['COMPLETE']} {cache_context.display_url[:URL_LOG_SHORTEN_LENGTH]}... | Status: {Fore.GREEN if crawl_result.success else Fore.RED}{crawl_result.success} | {Fore.YELLOW}Total: {time.perf_counter() - start_time:.2f}s{Style.RESET_ALL}")
|
||||
self.logger.success(
|
||||
message="{url:.50}... | Status: {status} | Total: {timing}",
|
||||
tag="COMPLETE",
|
||||
params={
|
||||
"url": cache_context.display_url,
|
||||
"status": crawl_result.success,
|
||||
"timing": f"{time.perf_counter() - start_time:.2f}s"
|
||||
},
|
||||
colors={
|
||||
"status": Fore.GREEN if crawl_result.success else Fore.RED,
|
||||
"timing": Fore.YELLOW
|
||||
}
|
||||
)
|
||||
|
||||
# Process the HTML content
|
||||
crawl_result = await self.aprocess_html(
|
||||
url=url,
|
||||
html=html,
|
||||
extracted_content=extracted_content,
|
||||
word_count_threshold=word_count_threshold,
|
||||
extraction_strategy=extraction_strategy,
|
||||
chunking_strategy=chunking_strategy,
|
||||
content_filter=content_filter,
|
||||
css_selector=css_selector,
|
||||
screenshot=screenshot_data,
|
||||
verbose=verbose,
|
||||
is_cached=bool(cached_result),
|
||||
async_response=async_response,
|
||||
is_web_url=cache_context.is_web_url,
|
||||
is_local_file=cache_context.is_local_file,
|
||||
is_raw_html=cache_context.is_raw_html,
|
||||
**kwargs,
|
||||
)
|
||||
# Update cache if appropriate
|
||||
if cache_context.should_write() and not bool(cached_result):
|
||||
await async_db_manager.acache_url(crawl_result)
|
||||
|
||||
return crawl_result
|
||||
|
||||
# Set response data
|
||||
if async_response:
|
||||
crawl_result.status_code = async_response.status_code
|
||||
crawl_result.response_headers = async_response.response_headers
|
||||
crawl_result.downloaded_files = async_response.downloaded_files
|
||||
else:
|
||||
crawl_result.status_code = 200
|
||||
crawl_result.response_headers = cached_result.response_headers if cached_result else {}
|
||||
|
||||
crawl_result.success = bool(html)
|
||||
crawl_result.session_id = kwargs.get("session_id", None)
|
||||
|
||||
# if verbose:
|
||||
# print(f"{Fore.GREEN}{self.tag_format('COMPLETE')} {self.log_icons['COMPLETE']} {cache_context.display_url[:URL_LOG_SHORTEN_LENGTH]}... | Status: {Fore.GREEN if crawl_result.success else Fore.RED}{crawl_result.success} | {Fore.YELLOW}Total: {time.perf_counter() - start_time:.2f}s{Style.RESET_ALL}")
|
||||
self.logger.success(
|
||||
message="{url:.50}... | Status: {status} | Total: {timing}",
|
||||
tag="COMPLETE",
|
||||
params={
|
||||
"url": cache_context.display_url,
|
||||
"status": crawl_result.success,
|
||||
"timing": f"{time.perf_counter() - start_time:.2f}s"
|
||||
},
|
||||
colors={
|
||||
"status": Fore.GREEN if crawl_result.success else Fore.RED,
|
||||
"timing": Fore.YELLOW
|
||||
}
|
||||
except Exception as e:
|
||||
if not hasattr(e, "msg"):
|
||||
e.msg = str(e)
|
||||
# print(f"{Fore.RED}{self.tag_format('ERROR')} {self.log_icons['ERROR']} Failed to crawl {cache_context.display_url[:URL_LOG_SHORTEN_LENGTH]}... | {e.msg}{Style.RESET_ALL}")
|
||||
|
||||
self.logger.error_status(
|
||||
url=cache_context.display_url,
|
||||
error=create_box_message(e.msg, type = "error"),
|
||||
tag="ERROR"
|
||||
)
|
||||
return CrawlResult(
|
||||
url=url,
|
||||
html="",
|
||||
success=False,
|
||||
error_message=e.msg
|
||||
)
|
||||
|
||||
# Update cache if appropriate
|
||||
if cache_context.should_write() and not bool(cached_result):
|
||||
await async_db_manager.acache_url(crawl_result)
|
||||
|
||||
return crawl_result
|
||||
|
||||
except Exception as e:
|
||||
if not hasattr(e, "msg"):
|
||||
e.msg = str(e)
|
||||
# print(f"{Fore.RED}{self.tag_format('ERROR')} {self.log_icons['ERROR']} Failed to crawl {cache_context.display_url[:URL_LOG_SHORTEN_LENGTH]}... | {e.msg}{Style.RESET_ALL}")
|
||||
|
||||
self.logger.error_status(
|
||||
url=cache_context.display_url,
|
||||
error=create_box_message(e.msg, type = "error"),
|
||||
tag="ERROR"
|
||||
)
|
||||
return CrawlResult(
|
||||
url=url,
|
||||
html="",
|
||||
success=False,
|
||||
error_message=e.msg
|
||||
)
|
||||
|
||||
async def arun_many(
|
||||
self,
|
||||
urls: List[str],
|
||||
@@ -472,7 +477,9 @@ class AsyncWebCrawler:
|
||||
try:
|
||||
_url = url if not kwargs.get("is_raw_html", False) else "Raw HTML"
|
||||
t1 = time.perf_counter()
|
||||
scrapping_strategy = WebScrapingStrategy()
|
||||
scrapping_strategy = WebScrapingStrategy(
|
||||
logger=self.logger,
|
||||
)
|
||||
# result = await scrapping_strategy.ascrap(
|
||||
result = scrapping_strategy.scrap(
|
||||
url,
|
||||
|
||||
@@ -4,10 +4,10 @@ from typing import List, Tuple, Dict
|
||||
from rank_bm25 import BM25Okapi
|
||||
from time import perf_counter
|
||||
from collections import deque
|
||||
from bs4 import BeautifulSoup, NavigableString, Tag
|
||||
from bs4 import BeautifulSoup, NavigableString, Tag, Comment
|
||||
from .utils import clean_tokens
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
import math
|
||||
from snowballstemmer import stemmer
|
||||
|
||||
|
||||
@@ -358,145 +358,186 @@ class BM25ContentFilter(RelevantContentFilter):
|
||||
return [self.clean_element(tag) for _, _, tag in selected_candidates]
|
||||
|
||||
|
||||
class HeuristicContentFilter(RelevantContentFilter):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
# Weights for different heuristics
|
||||
self.tag_weights = {
|
||||
'article': 10,
|
||||
'main': 8,
|
||||
'section': 5,
|
||||
'div': 3,
|
||||
'p': 2,
|
||||
'pre': 2,
|
||||
'code': 2,
|
||||
'blockquote': 2,
|
||||
'li': 1,
|
||||
'span': 1,
|
||||
}
|
||||
self.max_depth = 5 # Maximum depth from body to consider
|
||||
|
||||
def filter_content(self, html: str) -> List[str]:
|
||||
"""Implements heuristic content filtering without relying on a query."""
|
||||
|
||||
|
||||
|
||||
class PruningContentFilter(RelevantContentFilter):
|
||||
def __init__(self, user_query: str = None, min_word_threshold: int = None,
|
||||
threshold_type: str = 'fixed', threshold: float = 0.48):
|
||||
super().__init__(user_query)
|
||||
self.min_word_threshold = min_word_threshold
|
||||
self.threshold_type = threshold_type
|
||||
self.threshold = threshold
|
||||
|
||||
# Add tag importance for dynamic threshold
|
||||
self.tag_importance = {
|
||||
'article': 1.5,
|
||||
'main': 1.4,
|
||||
'section': 1.3,
|
||||
'p': 1.2,
|
||||
'h1': 1.4,
|
||||
'h2': 1.3,
|
||||
'h3': 1.2,
|
||||
'div': 0.7,
|
||||
'span': 0.6
|
||||
}
|
||||
|
||||
# Metric configuration
|
||||
self.metric_config = {
|
||||
'text_density': True,
|
||||
'link_density': True,
|
||||
'tag_weight': True,
|
||||
'class_id_weight': True,
|
||||
'text_length': True,
|
||||
}
|
||||
|
||||
self.metric_weights = {
|
||||
'text_density': 0.4,
|
||||
'link_density': 0.2,
|
||||
'tag_weight': 0.2,
|
||||
'class_id_weight': 0.1,
|
||||
'text_length': 0.1,
|
||||
}
|
||||
|
||||
self.tag_weights = {
|
||||
'div': 0.5,
|
||||
'p': 1.0,
|
||||
'article': 1.5,
|
||||
'section': 1.0,
|
||||
'span': 0.3,
|
||||
'li': 0.5,
|
||||
'ul': 0.5,
|
||||
'ol': 0.5,
|
||||
'h1': 1.2,
|
||||
'h2': 1.1,
|
||||
'h3': 1.0,
|
||||
'h4': 0.9,
|
||||
'h5': 0.8,
|
||||
'h6': 0.7,
|
||||
}
|
||||
|
||||
def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
|
||||
if not html or not isinstance(html, str):
|
||||
return []
|
||||
|
||||
|
||||
soup = BeautifulSoup(html, 'lxml')
|
||||
|
||||
# Ensure there is a body tag
|
||||
if not soup.body:
|
||||
soup = BeautifulSoup(f'<body>{html}</body>', 'lxml')
|
||||
body = soup.body
|
||||
|
||||
# Remove comments and unwanted tags
|
||||
self._remove_comments(soup)
|
||||
self._remove_unwanted_tags(soup)
|
||||
|
||||
# Prune tree starting from body
|
||||
body = soup.find('body')
|
||||
self._prune_tree(body)
|
||||
|
||||
# Extract remaining content as list of HTML strings
|
||||
content_blocks = []
|
||||
for element in body.children:
|
||||
if isinstance(element, str) or not hasattr(element, 'name'):
|
||||
continue
|
||||
if len(element.get_text(strip=True)) > 0:
|
||||
content_blocks.append(str(element))
|
||||
|
||||
return content_blocks
|
||||
|
||||
# Extract candidate text chunks
|
||||
candidates = self.extract_text_chunks(body)
|
||||
def _remove_comments(self, soup):
|
||||
for element in soup(text=lambda text: isinstance(text, Comment)):
|
||||
element.extract()
|
||||
|
||||
if not candidates:
|
||||
return []
|
||||
def _remove_unwanted_tags(self, soup):
|
||||
for tag in self.excluded_tags:
|
||||
for element in soup.find_all(tag):
|
||||
element.decompose()
|
||||
|
||||
# Score each candidate
|
||||
scored_candidates = []
|
||||
for index, text, tag_type, tag in candidates:
|
||||
score = self.score_element(tag, text)
|
||||
if score > 0:
|
||||
scored_candidates.append((score, index, text, tag))
|
||||
def _prune_tree(self, node):
|
||||
if not node or not hasattr(node, 'name') or node.name is None:
|
||||
return
|
||||
|
||||
# Sort candidates by score and then by document order
|
||||
scored_candidates.sort(key=lambda x: (-x[0], x[1]))
|
||||
text_len = len(node.get_text(strip=True))
|
||||
tag_len = len(node.encode_contents().decode('utf-8'))
|
||||
link_text_len = sum(len(s.strip()) for s in (a.string for a in node.find_all('a', recursive=False)) if s)
|
||||
|
||||
# Extract the top candidates (e.g., top 5)
|
||||
top_candidates = scored_candidates[:5] # Adjust the number as needed
|
||||
metrics = {
|
||||
'node': node,
|
||||
'tag_name': node.name,
|
||||
'text_len': text_len,
|
||||
'tag_len': tag_len,
|
||||
'link_text_len': link_text_len
|
||||
}
|
||||
|
||||
# Sort the top candidates back to their original document order
|
||||
top_candidates.sort(key=lambda x: x[1])
|
||||
score = self._compute_composite_score(metrics, text_len, tag_len, link_text_len)
|
||||
|
||||
# Clean and return the content
|
||||
return [self.clean_element(tag) for _, _, _, tag in top_candidates]
|
||||
if self.threshold_type == 'fixed':
|
||||
should_remove = score < self.threshold
|
||||
else: # dynamic
|
||||
tag_importance = self.tag_importance.get(node.name, 0.7)
|
||||
text_ratio = text_len / tag_len if tag_len > 0 else 0
|
||||
link_ratio = link_text_len / text_len if text_len > 0 else 1
|
||||
|
||||
threshold = self.threshold # base threshold
|
||||
if tag_importance > 1:
|
||||
threshold *= 0.8
|
||||
if text_ratio > 0.4:
|
||||
threshold *= 0.9
|
||||
if link_ratio > 0.6:
|
||||
threshold *= 1.2
|
||||
|
||||
should_remove = score < threshold
|
||||
|
||||
def score_element(self, tag: Tag, text: str) -> float:
|
||||
"""Compute a score for an element based on heuristics."""
|
||||
if not text or not tag:
|
||||
return 0
|
||||
if should_remove:
|
||||
node.decompose()
|
||||
else:
|
||||
children = [child for child in node.children if hasattr(child, 'name')]
|
||||
for child in children:
|
||||
self._prune_tree(child)
|
||||
|
||||
# Exclude unwanted tags
|
||||
if self.is_excluded(tag):
|
||||
return 0
|
||||
def _compute_composite_score(self, metrics, text_len, tag_len, link_text_len):
|
||||
if self.min_word_threshold:
|
||||
# Get raw text from metrics node - avoid extra processing
|
||||
text = metrics['node'].get_text(strip=True)
|
||||
word_count = text.count(' ') + 1
|
||||
if word_count < self.min_word_threshold:
|
||||
return -1.0 # Guaranteed removal
|
||||
score = 0.0
|
||||
total_weight = 0.0
|
||||
|
||||
# Text density
|
||||
text_length = len(text.strip())
|
||||
html_length = len(str(tag))
|
||||
text_density = text_length / html_length if html_length > 0 else 0
|
||||
if self.metric_config['text_density']:
|
||||
density = text_len / tag_len if tag_len > 0 else 0
|
||||
score += self.metric_weights['text_density'] * density
|
||||
total_weight += self.metric_weights['text_density']
|
||||
|
||||
# Link density
|
||||
link_text_length = sum(len(a.get_text().strip()) for a in tag.find_all('a'))
|
||||
link_density = link_text_length / text_length if text_length > 0 else 0
|
||||
if self.metric_config['link_density']:
|
||||
density = 1 - (link_text_len / text_len if text_len > 0 else 0)
|
||||
score += self.metric_weights['link_density'] * density
|
||||
total_weight += self.metric_weights['link_density']
|
||||
|
||||
# Tag weight
|
||||
tag_weight = self.tag_weights.get(tag.name, 1)
|
||||
if self.metric_config['tag_weight']:
|
||||
tag_score = self.tag_weights.get(metrics['tag_name'], 0.5)
|
||||
score += self.metric_weights['tag_weight'] * tag_score
|
||||
total_weight += self.metric_weights['tag_weight']
|
||||
|
||||
# Depth factor (prefer elements closer to the body tag)
|
||||
depth = self.get_depth(tag)
|
||||
depth_weight = max(self.max_depth - depth, 1) / self.max_depth
|
||||
if self.metric_config['class_id_weight']:
|
||||
class_score = self._compute_class_id_weight(metrics['node'])
|
||||
score += self.metric_weights['class_id_weight'] * max(0, class_score)
|
||||
total_weight += self.metric_weights['class_id_weight']
|
||||
|
||||
# Compute the final score
|
||||
score = (text_density * tag_weight * depth_weight) / (1 + link_density)
|
||||
if self.metric_config['text_length']:
|
||||
score += self.metric_weights['text_length'] * math.log(text_len + 1)
|
||||
total_weight += self.metric_weights['text_length']
|
||||
|
||||
return score
|
||||
return score / total_weight if total_weight > 0 else 0
|
||||
|
||||
def get_depth(self, tag: Tag) -> int:
|
||||
"""Compute the depth of the tag from the body tag."""
|
||||
depth = 0
|
||||
current = tag
|
||||
while current and current != current.parent and current.name != 'body':
|
||||
current = current.parent
|
||||
depth += 1
|
||||
return depth
|
||||
|
||||
def extract_text_chunks(self, body: Tag) -> List[Tuple[int, str, str, Tag]]:
|
||||
"""
|
||||
Extracts text chunks from the body element while preserving order.
|
||||
Returns list of tuples (index, text, tag_type, tag) for scoring.
|
||||
"""
|
||||
chunks = []
|
||||
index = 0
|
||||
|
||||
def traverse(element):
|
||||
nonlocal index
|
||||
if isinstance(element, NavigableString):
|
||||
return
|
||||
if not isinstance(element, Tag):
|
||||
return
|
||||
if self.is_excluded(element):
|
||||
return
|
||||
# Only consider included tags
|
||||
if element.name in self.included_tags:
|
||||
text = element.get_text(separator=' ', strip=True)
|
||||
if len(text.split()) >= self.min_word_count:
|
||||
tag_type = 'header' if element.name in self.header_tags else 'content'
|
||||
chunks.append((index, text, tag_type, element))
|
||||
index += 1
|
||||
# Do not traverse children of this element to prevent duplication
|
||||
return
|
||||
for child in element.children:
|
||||
traverse(child)
|
||||
|
||||
traverse(body)
|
||||
return chunks
|
||||
|
||||
def is_excluded(self, tag: Tag) -> bool:
|
||||
"""Determine if a tag should be excluded based on heuristics."""
|
||||
if tag.name in self.excluded_tags:
|
||||
return True
|
||||
class_id = ' '.join(filter(None, [
|
||||
' '.join(tag.get('class', [])),
|
||||
tag.get('id', '')
|
||||
]))
|
||||
if self.negative_patterns.search(class_id):
|
||||
return True
|
||||
# Exclude tags with high link density (e.g., navigation menus)
|
||||
text = tag.get_text(separator=' ', strip=True)
|
||||
link_text_length = sum(len(a.get_text(strip=True)) for a in tag.find_all('a'))
|
||||
text_length = len(text)
|
||||
if text_length > 0 and (link_text_length / text_length) > 0.5:
|
||||
return True
|
||||
return False
|
||||
def _compute_class_id_weight(self, node):
|
||||
class_id_score = 0
|
||||
if 'class' in node.attrs:
|
||||
classes = ' '.join(node['class'])
|
||||
if self.negative_patterns.match(classes):
|
||||
class_id_score -= 0.5
|
||||
if 'id' in node.attrs:
|
||||
element_id = node['id']
|
||||
if self.negative_patterns.match(element_id):
|
||||
class_id_score -= 0.5
|
||||
return class_id_score
|
||||
@@ -9,7 +9,7 @@ from bs4 import element, NavigableString, Comment
|
||||
from urllib.parse import urljoin
|
||||
from requests.exceptions import InvalidSchema
|
||||
# from .content_cleaning_strategy import ContentCleaningStrategy
|
||||
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter#, HeuristicContentFilter
|
||||
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter, PruningContentFilter
|
||||
from .markdown_generation_strategy import MarkdownGenerationStrategy, DefaultMarkdownGenerator
|
||||
from .models import MarkdownGenerationResult
|
||||
from .utils import (
|
||||
@@ -110,10 +110,15 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
if markdown_generator:
|
||||
try:
|
||||
if kwargs.get('fit_markdown', False) and not markdown_generator.content_filter:
|
||||
markdown_generator.content_filter = BM25ContentFilter(
|
||||
user_query=kwargs.get('fit_markdown_user_query', None),
|
||||
bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
|
||||
markdown_generator.content_filter = PruningContentFilter(
|
||||
threshold_type=kwargs.get('fit_markdown_treshold_type', 'fixed'),
|
||||
threshold=kwargs.get('fit_markdown_treshold', 0.48),
|
||||
min_word_threshold=kwargs.get('fit_markdown_min_word_threshold', ),
|
||||
)
|
||||
# markdown_generator.content_filter = BM25ContentFilter(
|
||||
# user_query=kwargs.get('fit_markdown_user_query', None),
|
||||
# bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
|
||||
# )
|
||||
|
||||
markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
|
||||
cleaned_html=cleaned_html,
|
||||
|
||||
@@ -11,8 +11,9 @@ LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')
|
||||
|
||||
class MarkdownGenerationStrategy(ABC):
|
||||
"""Abstract base class for markdown generation strategies."""
|
||||
def __init__(self, content_filter: Optional[RelevantContentFilter] = None):
|
||||
def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
|
||||
self.content_filter = content_filter
|
||||
self.options = options or {}
|
||||
|
||||
@abstractmethod
|
||||
def generate_markdown(self,
|
||||
@@ -27,8 +28,8 @@ class MarkdownGenerationStrategy(ABC):
|
||||
|
||||
class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
"""Default implementation of markdown generation strategy."""
|
||||
def __init__(self, content_filter: Optional[RelevantContentFilter] = None):
|
||||
super().__init__(content_filter)
|
||||
def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
|
||||
super().__init__(content_filter, options)
|
||||
|
||||
def convert_links_to_citations(self, markdown: str, base_url: str = "") -> Tuple[str, str]:
|
||||
link_map = {}
|
||||
@@ -74,6 +75,7 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
cleaned_html: str,
|
||||
base_url: str = "",
|
||||
html2text_options: Optional[Dict[str, Any]] = None,
|
||||
options: Optional[Dict[str, Any]] = None,
|
||||
content_filter: Optional[RelevantContentFilter] = None,
|
||||
citations: bool = True,
|
||||
**kwargs) -> MarkdownGenerationResult:
|
||||
@@ -82,6 +84,10 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
h = CustomHTML2Text()
|
||||
if html2text_options:
|
||||
h.update_params(**html2text_options)
|
||||
elif options:
|
||||
h.update_params(**options)
|
||||
elif self.options:
|
||||
h.update_params(**self.options)
|
||||
|
||||
# Generate raw markdown
|
||||
raw_markdown = h.handle(cleaned_html)
|
||||
|
||||
263
crawl4ai/user_agent_generator.py
Normal file
263
crawl4ai/user_agent_generator.py
Normal file
@@ -0,0 +1,263 @@
|
||||
import random
|
||||
from typing import Optional, Literal, List, Dict, Tuple
|
||||
import re
|
||||
|
||||
|
||||
class UserAgentGenerator:
|
||||
def __init__(self):
|
||||
# Previous platform definitions remain the same...
|
||||
self.desktop_platforms = {
|
||||
"windows": {
|
||||
"10_64": "(Windows NT 10.0; Win64; x64)",
|
||||
"10_32": "(Windows NT 10.0; WOW64)",
|
||||
},
|
||||
"macos": {
|
||||
"intel": "(Macintosh; Intel Mac OS X 10_15_7)",
|
||||
"newer": "(Macintosh; Intel Mac OS X 10.15; rv:109.0)",
|
||||
},
|
||||
"linux": {
|
||||
"generic": "(X11; Linux x86_64)",
|
||||
"ubuntu": "(X11; Ubuntu; Linux x86_64)",
|
||||
"chrome_os": "(X11; CrOS x86_64 14541.0.0)",
|
||||
}
|
||||
}
|
||||
|
||||
self.mobile_platforms = {
|
||||
"android": {
|
||||
"samsung": "(Linux; Android 13; SM-S901B)",
|
||||
"pixel": "(Linux; Android 12; Pixel 6)",
|
||||
"oneplus": "(Linux; Android 13; OnePlus 9 Pro)",
|
||||
"xiaomi": "(Linux; Android 12; M2102J20SG)",
|
||||
},
|
||||
"ios": {
|
||||
"iphone": "(iPhone; CPU iPhone OS 16_5 like Mac OS X)",
|
||||
"ipad": "(iPad; CPU OS 16_5 like Mac OS X)",
|
||||
}
|
||||
}
|
||||
|
||||
# Browser Combinations
|
||||
self.browser_combinations = {
|
||||
1: [
|
||||
["chrome"],
|
||||
["firefox"],
|
||||
["safari"],
|
||||
["edge"]
|
||||
],
|
||||
2: [
|
||||
["gecko", "firefox"],
|
||||
["chrome", "safari"],
|
||||
["webkit", "safari"]
|
||||
],
|
||||
3: [
|
||||
["chrome", "safari", "edge"],
|
||||
["webkit", "chrome", "safari"]
|
||||
]
|
||||
}
|
||||
|
||||
# Rendering Engines with versions
|
||||
self.rendering_engines = {
|
||||
"chrome_webkit": "AppleWebKit/537.36",
|
||||
"safari_webkit": "AppleWebKit/605.1.15",
|
||||
"gecko": [ # Added Gecko versions
|
||||
"Gecko/20100101",
|
||||
"Gecko/20100101", # Firefox usually uses this constant version
|
||||
"Gecko/2010010",
|
||||
]
|
||||
}
|
||||
|
||||
# Browser Versions
|
||||
self.chrome_versions = [
|
||||
"Chrome/119.0.6045.199",
|
||||
"Chrome/118.0.5993.117",
|
||||
"Chrome/117.0.5938.149",
|
||||
"Chrome/116.0.5845.187",
|
||||
"Chrome/115.0.5790.171",
|
||||
]
|
||||
|
||||
self.edge_versions = [
|
||||
"Edg/119.0.2151.97",
|
||||
"Edg/118.0.2088.76",
|
||||
"Edg/117.0.2045.47",
|
||||
"Edg/116.0.1938.81",
|
||||
"Edg/115.0.1901.203",
|
||||
]
|
||||
|
||||
self.safari_versions = [
|
||||
"Safari/537.36", # For Chrome-based
|
||||
"Safari/605.1.15",
|
||||
"Safari/604.1",
|
||||
"Safari/602.1",
|
||||
"Safari/601.5.17",
|
||||
]
|
||||
|
||||
# Added Firefox versions
|
||||
self.firefox_versions = [
|
||||
"Firefox/119.0",
|
||||
"Firefox/118.0.2",
|
||||
"Firefox/117.0.1",
|
||||
"Firefox/116.0",
|
||||
"Firefox/115.0.3",
|
||||
"Firefox/114.0.2",
|
||||
"Firefox/113.0.1",
|
||||
"Firefox/112.0",
|
||||
"Firefox/111.0.1",
|
||||
"Firefox/110.0",
|
||||
]
|
||||
|
||||
def get_browser_stack(self, num_browsers: int = 1) -> List[str]:
|
||||
"""Get a valid combination of browser versions"""
|
||||
if num_browsers not in self.browser_combinations:
|
||||
raise ValueError(f"Unsupported number of browsers: {num_browsers}")
|
||||
|
||||
combination = random.choice(self.browser_combinations[num_browsers])
|
||||
browser_stack = []
|
||||
|
||||
for browser in combination:
|
||||
if browser == "chrome":
|
||||
browser_stack.append(random.choice(self.chrome_versions))
|
||||
elif browser == "firefox":
|
||||
browser_stack.append(random.choice(self.firefox_versions))
|
||||
elif browser == "safari":
|
||||
browser_stack.append(random.choice(self.safari_versions))
|
||||
elif browser == "edge":
|
||||
browser_stack.append(random.choice(self.edge_versions))
|
||||
elif browser == "gecko":
|
||||
browser_stack.append(random.choice(self.rendering_engines["gecko"]))
|
||||
elif browser == "webkit":
|
||||
browser_stack.append(self.rendering_engines["chrome_webkit"])
|
||||
|
||||
return browser_stack
|
||||
|
||||
def generate(self,
|
||||
device_type: Optional[Literal['desktop', 'mobile']] = None,
|
||||
os_type: Optional[str] = None,
|
||||
device_brand: Optional[str] = None,
|
||||
browser_type: Optional[Literal['chrome', 'edge', 'safari', 'firefox']] = None,
|
||||
num_browsers: int = 3) -> str:
|
||||
"""
|
||||
Generate a random user agent with specified constraints.
|
||||
|
||||
Args:
|
||||
device_type: 'desktop' or 'mobile'
|
||||
os_type: 'windows', 'macos', 'linux', 'android', 'ios'
|
||||
device_brand: Specific device brand
|
||||
browser_type: 'chrome', 'edge', 'safari', or 'firefox'
|
||||
num_browsers: Number of browser specifications (1-3)
|
||||
"""
|
||||
# Get platform string
|
||||
platform = self.get_random_platform(device_type, os_type, device_brand)
|
||||
|
||||
# Start with Mozilla
|
||||
components = ["Mozilla/5.0", platform]
|
||||
|
||||
# Add browser stack
|
||||
browser_stack = self.get_browser_stack(num_browsers)
|
||||
|
||||
# Add appropriate legacy token based on browser stack
|
||||
if "Firefox" in str(browser_stack):
|
||||
components.append(random.choice(self.rendering_engines["gecko"]))
|
||||
elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack):
|
||||
components.append(self.rendering_engines["chrome_webkit"])
|
||||
components.append("(KHTML, like Gecko)")
|
||||
|
||||
# Add browser versions
|
||||
components.extend(browser_stack)
|
||||
|
||||
return " ".join(components)
|
||||
|
||||
def generate_with_client_hints(self, **kwargs) -> Tuple[str, str]:
|
||||
"""Generate both user agent and matching client hints"""
|
||||
user_agent = self.generate(**kwargs)
|
||||
client_hints = self.generate_client_hints(user_agent)
|
||||
return user_agent, client_hints
|
||||
|
||||
def get_random_platform(self, device_type, os_type, device_brand):
|
||||
"""Helper method to get random platform based on constraints"""
|
||||
platforms = self.desktop_platforms if device_type == 'desktop' else \
|
||||
self.mobile_platforms if device_type == 'mobile' else \
|
||||
{**self.desktop_platforms, **self.mobile_platforms}
|
||||
|
||||
if os_type:
|
||||
for platform_group in [self.desktop_platforms, self.mobile_platforms]:
|
||||
if os_type in platform_group:
|
||||
platforms = {os_type: platform_group[os_type]}
|
||||
break
|
||||
|
||||
os_key = random.choice(list(platforms.keys()))
|
||||
if device_brand and device_brand in platforms[os_key]:
|
||||
return platforms[os_key][device_brand]
|
||||
return random.choice(list(platforms[os_key].values()))
|
||||
|
||||
def parse_user_agent(self, user_agent: str) -> Dict[str, str]:
|
||||
"""Parse a user agent string to extract browser and version information"""
|
||||
browsers = {
|
||||
'chrome': r'Chrome/(\d+)',
|
||||
'edge': r'Edg/(\d+)',
|
||||
'safari': r'Version/(\d+)',
|
||||
'firefox': r'Firefox/(\d+)'
|
||||
}
|
||||
|
||||
result = {}
|
||||
for browser, pattern in browsers.items():
|
||||
match = re.search(pattern, user_agent)
|
||||
if match:
|
||||
result[browser] = match.group(1)
|
||||
|
||||
return result
|
||||
|
||||
def generate_client_hints(self, user_agent: str) -> str:
|
||||
"""Generate Sec-CH-UA header value based on user agent string"""
|
||||
browsers = self.parse_user_agent(user_agent)
|
||||
|
||||
# Client hints components
|
||||
hints = []
|
||||
|
||||
# Handle different browser combinations
|
||||
if 'chrome' in browsers:
|
||||
hints.append(f'"Chromium";v="{browsers["chrome"]}"')
|
||||
hints.append('"Not_A Brand";v="8"')
|
||||
|
||||
if 'edge' in browsers:
|
||||
hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
|
||||
else:
|
||||
hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
|
||||
|
||||
elif 'firefox' in browsers:
|
||||
# Firefox doesn't typically send Sec-CH-UA
|
||||
return '""'
|
||||
|
||||
elif 'safari' in browsers:
|
||||
# Safari's format for client hints
|
||||
hints.append(f'"Safari";v="{browsers["safari"]}"')
|
||||
hints.append('"Not_A Brand";v="8"')
|
||||
|
||||
return ', '.join(hints)
|
||||
|
||||
# Example usage:
|
||||
if __name__ == "__main__":
|
||||
generator = UserAgentGenerator()
|
||||
print(generator.generate())
|
||||
|
||||
print("\nSingle browser (Chrome):")
|
||||
print(generator.generate(num_browsers=1, browser_type='chrome'))
|
||||
|
||||
print("\nTwo browsers (Gecko/Firefox):")
|
||||
print(generator.generate(num_browsers=2))
|
||||
|
||||
print("\nThree browsers (Chrome/Safari/Edge):")
|
||||
print(generator.generate(num_browsers=3))
|
||||
|
||||
print("\nFirefox on Linux:")
|
||||
print(generator.generate(
|
||||
device_type='desktop',
|
||||
os_type='linux',
|
||||
browser_type='firefox',
|
||||
num_browsers=2
|
||||
))
|
||||
|
||||
print("\nChrome/Safari/Edge on Windows:")
|
||||
print(generator.generate(
|
||||
device_type='desktop',
|
||||
os_type='windows',
|
||||
num_browsers=3
|
||||
))
|
||||
@@ -15,7 +15,7 @@ from bs4 import BeautifulSoup
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
|
||||
from crawl4ai.extraction_strategy import (
|
||||
JsonCssExtractionStrategy,
|
||||
LLMExtractionStrategy,
|
||||
@@ -466,7 +466,8 @@ async def speed_comparison():
|
||||
url="https://www.nbcnews.com/business",
|
||||
word_count_threshold=0,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
||||
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
||||
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
||||
),
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
verbose=False,
|
||||
@@ -489,7 +490,8 @@ async def speed_comparison():
|
||||
word_count_threshold=0,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
||||
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
||||
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
||||
),
|
||||
verbose=False,
|
||||
)
|
||||
@@ -545,19 +547,50 @@ async def generate_knowledge_graph():
|
||||
f.write(result.extracted_content)
|
||||
|
||||
async def fit_markdown_remove_overlay():
|
||||
async with AsyncWebCrawler(headless = False) as crawler:
|
||||
url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
|
||||
async with AsyncWebCrawler(
|
||||
headless=True, # Set to False to see what is happening
|
||||
verbose=True,
|
||||
user_agent_mode="random",
|
||||
user_agent_generator_config={
|
||||
"device_type": "mobile",
|
||||
"os_type": "android"
|
||||
},
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
url='https://www.kidocode.com/degrees/technology',
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
word_count_threshold = 10,
|
||||
remove_overlay_elements=True,
|
||||
screenshot = True
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0),
|
||||
options={
|
||||
"ignore_links": True
|
||||
}
|
||||
),
|
||||
# markdown_generator=DefaultMarkdownGenerator(
|
||||
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0),
|
||||
# options={
|
||||
# "ignore_links": True
|
||||
# }
|
||||
# ),
|
||||
)
|
||||
# Save markdown to file
|
||||
with open(os.path.join(__location__, "mexico_places.md"), "w") as f:
|
||||
f.write(result.fit_markdown)
|
||||
|
||||
|
||||
if result.success:
|
||||
print(len(result.markdown_v2.raw_markdown))
|
||||
print(len(result.markdown_v2.markdown_with_citations))
|
||||
print(len(result.markdown_v2.fit_markdown))
|
||||
|
||||
# Save clean html
|
||||
with open(os.path.join(__location__, "output/cleaned_html.html"), "w") as f:
|
||||
f.write(result.cleaned_html)
|
||||
|
||||
with open(os.path.join(__location__, "output/output_raw_markdown.md"), "w") as f:
|
||||
f.write(result.markdown_v2.raw_markdown)
|
||||
|
||||
with open(os.path.join(__location__, "output/output_markdown_with_citations.md"), "w") as f:
|
||||
f.write(result.markdown_v2.markdown_with_citations)
|
||||
|
||||
with open(os.path.join(__location__, "output/output_fit_markdown.md"), "w") as f:
|
||||
f.write(result.markdown_v2.fit_markdown)
|
||||
|
||||
print("Done")
|
||||
|
||||
|
||||
|
||||
@@ -4,7 +4,59 @@ This guide explains how to use content filtering strategies in Crawl4AI to extra
|
||||
|
||||
## Relevance Content Filter
|
||||
|
||||
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
|
||||
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
|
||||
|
||||
|
||||
## Pruning Content Filter
|
||||
|
||||
The `PruningContentFilter` is a tree-shaking algorithm that analyzes the HTML DOM structure and removes less relevant nodes based on various metrics like text density, link density, and tag importance. It evaluates each node using a composite scoring system and "prunes" nodes that fall below a certain threshold.
|
||||
|
||||
### Usage
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
async def filter_content(url):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
content_filter = PruningContentFilter(
|
||||
min_word_threshold=5,
|
||||
threshold_type='dynamic',
|
||||
threshold=0.45
|
||||
)
|
||||
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True)
|
||||
if result.success:
|
||||
print(f"Cleaned Markdown:\n{result.fit_markdown}")
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
- **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
|
||||
|
||||
- **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
|
||||
- `'fixed'`: Uses a constant threshold value for all nodes
|
||||
- `'dynamic'`: Adjusts threshold based on node characteristics like tag importance and text/link ratios
|
||||
|
||||
- **`threshold`**: (Optional, default 0.48) Base threshold value for node pruning:
|
||||
- For fixed threshold: Nodes scoring below this value are removed
|
||||
- For dynamic threshold: This value is adjusted based on node properties
|
||||
|
||||
### How It Works
|
||||
|
||||
The pruning algorithm evaluates each node using multiple metrics:
|
||||
- Text density: Ratio of actual text to overall node content
|
||||
- Link density: Proportion of text within links
|
||||
- Tag importance: Weight based on HTML tag type (e.g., article, p, div)
|
||||
- Content quality: Metrics like text length and structural importance
|
||||
|
||||
Nodes scoring below the threshold are removed, effectively "shaking" less relevant content from the DOM tree. This results in a cleaner document containing only the most relevant content blocks.
|
||||
|
||||
The algorithm is particularly effective for:
|
||||
- Removing boilerplate content
|
||||
- Eliminating navigation menus and sidebars
|
||||
- Preserving main article content
|
||||
- Maintaining document structure while removing noise
|
||||
|
||||
|
||||
## BM25 Algorithm
|
||||
|
||||
|
||||
@@ -4,7 +4,59 @@ This guide explains how to use content filtering strategies in Crawl4AI to extra
|
||||
|
||||
## Relevance Content Filter
|
||||
|
||||
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
|
||||
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
|
||||
|
||||
|
||||
## Pruning Content Filter
|
||||
|
||||
The `PruningContentFilter` is a tree-shaking algorithm that analyzes the HTML DOM structure and removes less relevant nodes based on various metrics like text density, link density, and tag importance. It evaluates each node using a composite scoring system and "prunes" nodes that fall below a certain threshold.
|
||||
|
||||
### Usage
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
async def filter_content(url):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
content_filter = PruningContentFilter(
|
||||
min_word_threshold=5,
|
||||
threshold_type='dynamic',
|
||||
threshold=0.45
|
||||
)
|
||||
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True)
|
||||
if result.success:
|
||||
print(f"Cleaned Markdown:\n{result.fit_markdown}")
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
- **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
|
||||
|
||||
- **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
|
||||
- `'fixed'`: Uses a constant threshold value for all nodes
|
||||
- `'dynamic'`: Adjusts threshold based on node characteristics like tag importance and text/link ratios
|
||||
|
||||
- **`threshold`**: (Optional, default 0.48) Base threshold value for node pruning:
|
||||
- For fixed threshold: Nodes scoring below this value are removed
|
||||
- For dynamic threshold: This value is adjusted based on node properties
|
||||
|
||||
### How It Works
|
||||
|
||||
The pruning algorithm evaluates each node using multiple metrics:
|
||||
- Text density: Ratio of actual text to overall node content
|
||||
- Link density: Proportion of text within links
|
||||
- Tag importance: Weight based on HTML tag type (e.g., article, p, div)
|
||||
- Content quality: Metrics like text length and structural importance
|
||||
|
||||
Nodes scoring below the threshold are removed, effectively "shaking" less relevant content from the DOM tree. This results in a cleaner document containing only the most relevant content blocks.
|
||||
|
||||
The algorithm is particularly effective for:
|
||||
- Removing boilerplate content
|
||||
- Eliminating navigation menus and sidebars
|
||||
- Preserving main article content
|
||||
- Maintaining document structure while removing noise
|
||||
|
||||
|
||||
## BM25 Algorithm
|
||||
|
||||
@@ -21,7 +73,7 @@ from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
async def filter_content(url, query=None):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
content_filter = BM25ContentFilter(user_query=query)
|
||||
result = await crawler.arun(url=url, content_filter=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
|
||||
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
|
||||
if result.success:
|
||||
print(f"Filtered Content (JSON):\n{result.extracted_content}")
|
||||
print(f"\nFiltered Markdown:\n{result.fit_markdown}") # New field in CrawlResult object
|
||||
@@ -71,7 +123,7 @@ class MyCustomFilter(RelevantContentFilter):
|
||||
async def custom_filter_demo(url: str):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
custom_filter = MyCustomFilter()
|
||||
result = await crawler.arun(url, content_filter=custom_filter)
|
||||
result = await crawler.arun(url, extraction_strategy=custom_filter)
|
||||
if result.success:
|
||||
print(result.extracted_content)
|
||||
|
||||
|
||||
28
docs/md_v2/blog/index.md
Normal file
28
docs/md_v2/blog/index.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# Crawl4AI Blog
|
||||
|
||||
Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical deep dives, and news about the project.
|
||||
|
||||
## Latest Release
|
||||
|
||||
### [0.4.0 - Major Content Filtering Update](releases/0.4.0.md)
|
||||
*December 1, 2024*
|
||||
|
||||
Introducing significant improvements to content filtering, multi-threaded environment handling, and user-agent generation. This release features the new PruningContentFilter, enhanced thread safety, and improved test coverage.
|
||||
|
||||
[Read full release notes →](releases/0.4.0.md)
|
||||
|
||||
## Project History
|
||||
|
||||
Want to see how we got here? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) covering all previous versions and the evolution of Crawl4AI.
|
||||
|
||||
## Categories
|
||||
|
||||
- [Technical Deep Dives](/blog/technical) - Coming soon
|
||||
- [Tutorials & Guides](/blog/tutorials) - Coming soon
|
||||
- [Community Updates](/blog/community) - Coming soon
|
||||
|
||||
## Stay Updated
|
||||
|
||||
- Star us on [GitHub](https://github.com/unclecode/crawl4ai)
|
||||
- Follow [@unclecode](https://twitter.com/unclecode) on Twitter
|
||||
- Join our community discussions on GitHub
|
||||
62
docs/md_v2/blog/releases/0.4.0.md
Normal file
62
docs/md_v2/blog/releases/0.4.0.md
Normal file
@@ -0,0 +1,62 @@
|
||||
# Release Summary for Version 0.4.0 (December 1, 2024)
|
||||
|
||||
## Overview
|
||||
The 0.4.0 release introduces significant improvements to content filtering, multi-threaded environment handling, user-agent generation, and test coverage. Key highlights include the introduction of the PruningContentFilter, designed to automatically identify and extract the most valuable parts of an HTML document, as well as enhancements to the BM25ContentFilter to extend its versatility and effectiveness.
|
||||
|
||||
## Major Features and Enhancements
|
||||
|
||||
### 1. PruningContentFilter
|
||||
- Introduced a new unsupervised content filtering strategy that scores and prunes less relevant nodes in an HTML document based on metrics like text and link density.
|
||||
- Focuses on retaining the most valuable parts of the content, making it highly effective for extracting relevant information from complex web pages.
|
||||
- Fully documented with updated README and expanded user guides.
|
||||
|
||||
### 2. User-Agent Generator
|
||||
- Added a user-agent generator utility that resolves compatibility issues and supports customizable user-agent strings.
|
||||
- By default, the generator randomizes user agents for each request, adding diversity, but users can customize it for tailored scenarios.
|
||||
|
||||
### 3. Enhanced Thread Safety
|
||||
- Improved handling of multi-threaded environments by adding better thread locks for parallel processing, ensuring consistency and stability when running multiple threads.
|
||||
|
||||
### 4. Extended Content Filtering Strategies
|
||||
- Users now have access to both the PruningContentFilter for unsupervised extraction and the BM25ContentFilter for supervised filtering based on user queries.
|
||||
- Enhanced BM25ContentFilter with improved capabilities to process page titles, meta tags, and descriptions, allowing for more effective classification and clustering of text chunks.
|
||||
|
||||
### 5. Documentation Updates
|
||||
- Updated examples and tutorials to promote the use of the PruningContentFilter alongside the BM25ContentFilter, providing clear instructions for selecting the appropriate filter for each use case.
|
||||
|
||||
### 6. Unit Test Enhancements
|
||||
- Added unit tests for PruningContentFilter to ensure accuracy and reliability.
|
||||
- Enhanced BM25ContentFilter tests to cover additional edge cases and performance metrics, particularly for malformed HTML inputs.
|
||||
|
||||
## Revised Change Logs for Version 0.4.0
|
||||
|
||||
### PruningContentFilter (Dec 01, 2024)
|
||||
- Introduced the PruningContentFilter to optimize content extraction by pruning less relevant HTML nodes.
|
||||
- **Affected Files:**
|
||||
- **crawl4ai/content_filter_strategy.py**: Added a scoring-based pruning algorithm.
|
||||
- **README.md**: Updated to include PruningContentFilter usage.
|
||||
- **docs/md_v2/basic/content_filtering.md**: Expanded user documentation, detailing the use and benefits of PruningContentFilter.
|
||||
|
||||
### Unit Tests for PruningContentFilter (Dec 01, 2024)
|
||||
- Added comprehensive unit tests for PruningContentFilter to ensure correctness and efficiency.
|
||||
- **Affected Files:**
|
||||
- **tests/async/test_content_filter_prune.py**: Created tests covering different pruning scenarios to ensure stability and correctness.
|
||||
|
||||
### Enhanced BM25ContentFilter Tests (Dec 01, 2024)
|
||||
- Expanded tests to cover additional extraction scenarios and performance metrics, improving robustness.
|
||||
- **Affected Files:**
|
||||
- **tests/async/test_content_filter_bm25.py**: Added tests for edge cases, including malformed HTML inputs.
|
||||
|
||||
### Documentation and Example Updates (Dec 01, 2024)
|
||||
- Revised examples to illustrate the use of PruningContentFilter alongside existing content filtering methods.
|
||||
- **Affected Files:**
|
||||
- **docs/examples/quickstart_async.py**: Enhanced example clarity and usability for new users.
|
||||
|
||||
## Experimental Features
|
||||
- The PruningContentFilter is still under experimental development, and we continue to gather feedback for further refinements.
|
||||
|
||||
## Conclusion
|
||||
This release significantly enhances the content extraction capabilities of Crawl4ai with the introduction of the PruningContentFilter, improved supervised filtering with BM25ContentFilter, and robust multi-threaded handling. Additionally, the user-agent generator provides much-needed versatility, resolving compatibility issues faced by many users.
|
||||
|
||||
Users are encouraged to experiment with the new content filtering methods to determine which best suits their needs.
|
||||
|
||||
14
mkdocs.yml
14
mkdocs.yml
@@ -10,7 +10,11 @@ nav:
|
||||
- 'Installation': 'basic/installation.md'
|
||||
- 'Docker Deplotment': 'basic/docker-deploymeny.md'
|
||||
- 'Quick Start': 'basic/quickstart.md'
|
||||
|
||||
- Changelog & Blog:
|
||||
- 'Blog Home': 'blog/index.md'
|
||||
- 'Latest (0.4.0)': 'blog/releases/0.4.0.md'
|
||||
- 'Changelog': 'https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md'
|
||||
|
||||
- Basic:
|
||||
- 'Simple Crawling': 'basic/simple-crawling.md'
|
||||
- 'Output Formats': 'basic/output-formats.md'
|
||||
@@ -50,12 +54,12 @@ nav:
|
||||
- '5. Dynamic Content': 'tutorial/episode_05_JavaScript_Execution_and_Dynamic_Content_Handling.md'
|
||||
- '6. Magic Mode': 'tutorial/episode_06_Magic_Mode_and_Anti-Bot_Protection.md'
|
||||
- '7. Content Cleaning': 'tutorial/episode_07_Content_Cleaning_and_Fit_Markdown.md'
|
||||
- '8. Media Handling': 'tutorial/episode_08_Media_Handling:_Images,_Videos,_and_Audio.md'
|
||||
- '8. Media Handling': 'tutorial/episode_08_Media_Handling_Images_Videos_and_Audio.md'
|
||||
- '9. Link Analysis': 'tutorial/episode_09_Link_Analysis_and_Smart_Filtering.md'
|
||||
- '10. User Simulation': 'tutorial/episode_10_Custom_Headers,_Identity,_and_User_Simulation.md'
|
||||
- '11.1. JSON CSS': 'tutorial/episode_11_1_Extraction_Strategies:_JSON_CSS.md'
|
||||
- '11.2. LLM Strategy': 'tutorial/episode_11_2_Extraction_Strategies:_LLM.md'
|
||||
- '11.3. Cosine Strategy': 'tutorial/episode_11_3_Extraction_Strategies:_Cosine.md'
|
||||
- '11.1. JSON CSS': 'tutorial/episode_11_1_Extraction_Strategies_JSON_CSS.md'
|
||||
- '11.2. LLM Strategy': 'tutorial/episode_11_2_Extraction_Strategies_LLM.md'
|
||||
- '11.3. Cosine Strategy': 'tutorial/episode_11_3_Extraction_Strategies_Cosine.md'
|
||||
- '12. Session Crawling': 'tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites.md'
|
||||
- '13. Text Chunking': 'tutorial/episode_13_Chunking_Strategies_for_Large_Text_Processing.md'
|
||||
- '14. Custom Workflows': 'tutorial/episode_14_Hooks_and_Custom_Workflow_with_AsyncWebCrawler.md'
|
||||
|
||||
159
tests/async/test_content_filter_prune.py
Normal file
159
tests/async/test_content_filter_prune.py
Normal file
@@ -0,0 +1,159 @@
|
||||
import os, sys
|
||||
import pytest
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
sys.path.append(parent_dir)
|
||||
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
@pytest.fixture
|
||||
def basic_html():
|
||||
return """
|
||||
<html>
|
||||
<body>
|
||||
<article>
|
||||
<h1>Main Article</h1>
|
||||
<p>This is a high-quality paragraph with substantial text content. It contains enough words to pass the threshold and has good text density without too many links. This kind of content should survive the pruning process.</p>
|
||||
<div class="sidebar">Low quality sidebar content</div>
|
||||
<div class="social-share">Share buttons</div>
|
||||
</article>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
@pytest.fixture
|
||||
def link_heavy_html():
|
||||
return """
|
||||
<html>
|
||||
<body>
|
||||
<div class="content">
|
||||
<p>Good content paragraph that should remain.</p>
|
||||
<div class="links">
|
||||
<a href="#">Link 1</a>
|
||||
<a href="#">Link 2</a>
|
||||
<a href="#">Link 3</a>
|
||||
<a href="#">Link 4</a>
|
||||
</div>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
@pytest.fixture
|
||||
def mixed_content_html():
|
||||
return """
|
||||
<html>
|
||||
<body>
|
||||
<article>
|
||||
<h1>Article Title</h1>
|
||||
<p class="summary">Short summary.</p>
|
||||
<div class="content">
|
||||
<p>Long high-quality paragraph with substantial content that should definitely survive the pruning process. This content has good text density and proper formatting which makes it valuable for retention.</p>
|
||||
</div>
|
||||
<div class="comments">
|
||||
<p>Short comment 1</p>
|
||||
<p>Short comment 2</p>
|
||||
</div>
|
||||
</article>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
class TestPruningContentFilter:
|
||||
def test_basic_pruning(self, basic_html):
|
||||
"""Test basic content pruning functionality"""
|
||||
filter = PruningContentFilter(min_word_threshold=5)
|
||||
contents = filter.filter_content(basic_html)
|
||||
|
||||
combined_content = ' '.join(contents).lower()
|
||||
assert "high-quality paragraph" in combined_content
|
||||
assert "sidebar content" not in combined_content
|
||||
assert "share buttons" not in combined_content
|
||||
|
||||
def test_min_word_threshold(self, mixed_content_html):
|
||||
"""Test minimum word threshold filtering"""
|
||||
filter = PruningContentFilter(min_word_threshold=10)
|
||||
contents = filter.filter_content(mixed_content_html)
|
||||
|
||||
combined_content = ' '.join(contents).lower()
|
||||
assert "short summary" not in combined_content
|
||||
assert "long high-quality paragraph" in combined_content
|
||||
assert "short comment" not in combined_content
|
||||
|
||||
def test_threshold_types(self, basic_html):
|
||||
"""Test fixed vs dynamic thresholds"""
|
||||
fixed_filter = PruningContentFilter(threshold_type='fixed', threshold=0.48)
|
||||
dynamic_filter = PruningContentFilter(threshold_type='dynamic', threshold=0.45)
|
||||
|
||||
fixed_contents = fixed_filter.filter_content(basic_html)
|
||||
dynamic_contents = dynamic_filter.filter_content(basic_html)
|
||||
|
||||
assert len(fixed_contents) != len(dynamic_contents), \
|
||||
"Fixed and dynamic thresholds should yield different results"
|
||||
|
||||
def test_link_density_impact(self, link_heavy_html):
|
||||
"""Test handling of link-heavy content"""
|
||||
filter = PruningContentFilter(threshold_type='dynamic')
|
||||
contents = filter.filter_content(link_heavy_html)
|
||||
|
||||
combined_content = ' '.join(contents).lower()
|
||||
assert "good content paragraph" in combined_content
|
||||
assert len([c for c in contents if 'href' in c]) < 2, \
|
||||
"Should prune link-heavy sections"
|
||||
|
||||
def test_tag_importance(self, mixed_content_html):
|
||||
"""Test tag importance in scoring"""
|
||||
filter = PruningContentFilter(threshold_type='dynamic')
|
||||
contents = filter.filter_content(mixed_content_html)
|
||||
|
||||
has_article = any('article' in c.lower() for c in contents)
|
||||
has_h1 = any('h1' in c.lower() for c in contents)
|
||||
assert has_article or has_h1, "Should retain important tags"
|
||||
|
||||
def test_empty_input(self):
|
||||
"""Test handling of empty input"""
|
||||
filter = PruningContentFilter()
|
||||
assert filter.filter_content("") == []
|
||||
assert filter.filter_content(None) == []
|
||||
|
||||
def test_malformed_html(self):
|
||||
"""Test handling of malformed HTML"""
|
||||
malformed_html = "<div>Unclosed div<p>Nested<span>content</div>"
|
||||
filter = PruningContentFilter()
|
||||
contents = filter.filter_content(malformed_html)
|
||||
assert isinstance(contents, list)
|
||||
|
||||
def test_performance(self, basic_html):
|
||||
"""Test performance with timer"""
|
||||
filter = PruningContentFilter()
|
||||
|
||||
import time
|
||||
start = time.perf_counter()
|
||||
filter.filter_content(basic_html)
|
||||
duration = time.perf_counter() - start
|
||||
|
||||
# Extra strict on performance since you mentioned milliseconds matter
|
||||
assert duration < 0.1, f"Processing took too long: {duration:.3f} seconds"
|
||||
|
||||
@pytest.mark.parametrize("threshold,expected_count", [
|
||||
(0.3, 4), # Very lenient
|
||||
(0.48, 2), # Default
|
||||
(0.7, 1), # Very strict
|
||||
])
|
||||
def test_threshold_levels(self, mixed_content_html, threshold, expected_count):
|
||||
"""Test different threshold levels"""
|
||||
filter = PruningContentFilter(threshold_type='fixed', threshold=threshold)
|
||||
contents = filter.filter_content(mixed_content_html)
|
||||
assert len(contents) <= expected_count, \
|
||||
f"Expected {expected_count} or fewer elements with threshold {threshold}"
|
||||
|
||||
def test_consistent_output(self, basic_html):
|
||||
"""Test output consistency across multiple runs"""
|
||||
filter = PruningContentFilter()
|
||||
first_run = filter.filter_content(basic_html)
|
||||
second_run = filter.filter_content(basic_html)
|
||||
assert first_run == second_run, "Output should be consistent"
|
||||
|
||||
if __name__ == "__main__":
|
||||
pytest.main([__file__])
|
||||
Reference in New Issue
Block a user