Compare commits
34 Commits
0.3.5
...
scraper-uc
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
0d357ab7d2 | ||
|
|
bae4665949 | ||
|
|
d11c004fbb | ||
|
|
3d1c9a8434 | ||
|
|
be472c624c | ||
|
|
06b21dcc50 | ||
|
|
0f0f60527d | ||
|
|
8105fd178e | ||
|
|
ce7fce4b16 | ||
|
|
de28b59aca | ||
|
|
04d8b47b92 | ||
|
|
2943feeecf | ||
|
|
8a7d29ce85 | ||
|
|
159bd875bd | ||
|
|
9ffa34b697 | ||
|
|
740802c491 | ||
|
|
b9ac96c332 | ||
|
|
d06535388a | ||
|
|
2b73bdf6b0 | ||
|
|
6aa803d712 | ||
|
|
320afdea64 | ||
|
|
ccbe72cfc1 | ||
|
|
b9bbd42373 | ||
|
|
68e9144ce3 | ||
|
|
9b2b267820 | ||
|
|
ff3524d9b1 | ||
|
|
b99d20b725 | ||
|
|
768b93140f | ||
|
|
d743adac68 | ||
|
|
7fe220dbd5 | ||
|
|
65e013d9d1 | ||
|
|
7f3e2e47ed | ||
|
|
78f26ac263 | ||
|
|
44ce12c62c |
8
.gitignore
vendored
8
.gitignore
vendored
@@ -201,3 +201,11 @@ test_env/
|
||||
todo.md
|
||||
git_changes.py
|
||||
git_changes.md
|
||||
pypi_build.sh
|
||||
git_issues.py
|
||||
git_issues.md
|
||||
|
||||
.tests/
|
||||
.issues/
|
||||
.docs/
|
||||
.issues/
|
||||
78
CHANGELOG.md
78
CHANGELOG.md
@@ -1,5 +1,83 @@
|
||||
# Changelog
|
||||
|
||||
## [v0.3.6] - 2024-10-12
|
||||
|
||||
### 1. Improved Crawling Control
|
||||
- **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`.
|
||||
- **Delayed HTML Retrieval**: Introduced `delay_before_return_html` parameter to allow waiting before retrieving HTML content.
|
||||
- Useful for pages with delayed content loading.
|
||||
- **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout.
|
||||
- Provides better handling for slow-loading pages.
|
||||
- **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.
|
||||
|
||||
### 2. Browser Type Selection
|
||||
- Added support for different browser types (Chromium, Firefox, WebKit).
|
||||
- Users can now specify the browser type when initializing AsyncWebCrawler.
|
||||
- **How to use**: Set `browser_type="firefox"` or `browser_type="webkit"` when initializing AsyncWebCrawler.
|
||||
|
||||
### 3. Screenshot Capture
|
||||
- Added ability to capture screenshots during crawling.
|
||||
- Useful for debugging and content verification.
|
||||
- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
|
||||
|
||||
### 4. Enhanced LLM Extraction Strategy
|
||||
- Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
|
||||
- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
|
||||
- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
|
||||
- **How to use**: Specify the desired provider and custom arguments when using `LLMExtractionStrategy`.
|
||||
|
||||
### 5. iframe Content Extraction
|
||||
- New feature to process and extract content from iframes.
|
||||
- **How to use**: Set `process_iframes=True` in the crawl method.
|
||||
|
||||
### 6. Delayed Content Retrieval
|
||||
- Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
|
||||
- Allows retrieval of content after a specified delay, useful for dynamically loaded content.
|
||||
- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
|
||||
|
||||
## Improvements and Optimizations
|
||||
|
||||
### 1. AsyncWebCrawler Enhancements
|
||||
- **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy.
|
||||
- Allows for more customized setups.
|
||||
|
||||
### 2. Image Processing Optimization
|
||||
- Enhanced image handling in WebScrappingStrategy.
|
||||
- Added filtering for small, invisible, or irrelevant images.
|
||||
- Improved image scoring system for better content relevance.
|
||||
- Implemented JavaScript-based image dimension updating for more accurate representation.
|
||||
|
||||
### 3. Database Schema Auto-updates
|
||||
- Automatic database schema updates ensure compatibility with the latest version.
|
||||
|
||||
### 4. Enhanced Error Handling and Logging
|
||||
- Improved error messages and logging for easier debugging.
|
||||
|
||||
### 5. Content Extraction Refinements
|
||||
- Refined HTML sanitization process.
|
||||
- Improved handling of base64 encoded images.
|
||||
- Enhanced Markdown conversion process.
|
||||
- Optimized content extraction algorithms.
|
||||
|
||||
### 6. Utility Function Enhancements
|
||||
- `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers.
|
||||
|
||||
## Bug Fixes
|
||||
- Fixed an issue where image tags were being prematurely removed during content extraction.
|
||||
|
||||
## Examples and Documentation
|
||||
- Updated `quickstart_async.py` with examples of:
|
||||
- Using custom headers in LLM extraction.
|
||||
- Different LLM provider usage (OpenAI, Hugging Face, Ollama).
|
||||
- Custom browser type usage.
|
||||
|
||||
## Developer Notes
|
||||
- Refactored code for better maintainability, flexibility, and performance.
|
||||
- Enhanced type hinting throughout the codebase for improved development experience.
|
||||
- Expanded error handling for more robust operation.
|
||||
|
||||
These updates significantly enhance the flexibility, accuracy, and robustness of crawl4ai, providing users with more control and options for their web crawling and content extraction tasks.
|
||||
|
||||
## [v0.3.5] - 2024-09-02
|
||||
|
||||
Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
|
||||
|
||||
10
README.md
10
README.md
@@ -10,6 +10,14 @@ Crawl4AI simplifies asynchronous web crawling and data extraction, making it acc
|
||||
|
||||
> Looking for the synchronous version? Check out [README.sync.md](./README.sync.md). You can also access the previous version in the branch [V0.2.76](https://github.com/unclecode/crawl4ai/blob/v0.2.76).
|
||||
|
||||
## New update 0.3.6
|
||||
- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
|
||||
- 🖼️ Improved image processing with lazy-loading detection
|
||||
- 🔧 Custom page timeout parameter for better control over crawling behavior
|
||||
- 🕰️ Enhanced handling of delayed content loading
|
||||
- 🔑 Custom headers support for LLM interactions
|
||||
- 🖼️ iframe content extraction for comprehensive page analysis
|
||||
- ⏱️ Flexible timeout and delayed content retrieval options
|
||||
|
||||
## Try it Now!
|
||||
|
||||
@@ -124,7 +132,7 @@ async def main():
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
js_code=js_code,
|
||||
css_selector="article.tease-card",
|
||||
css_selector=".wide-tease-item__description",
|
||||
bypass_cache=True
|
||||
)
|
||||
print(result.extracted_content)
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
from .async_webcrawler import AsyncWebCrawler
|
||||
from .models import CrawlResult
|
||||
|
||||
__version__ = "0.3.5"
|
||||
__version__ = "0.3.6"
|
||||
|
||||
__all__ = [
|
||||
"AsyncWebCrawler",
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
import asyncio
|
||||
import base64, time
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Callable, Dict, Any, List, Optional
|
||||
from typing import Callable, Dict, Any, List, Optional, Awaitable
|
||||
import os
|
||||
from playwright.async_api import async_playwright, Page, Browser, Error
|
||||
from io import BytesIO
|
||||
@@ -18,6 +18,10 @@ class AsyncCrawlResponse(BaseModel):
|
||||
response_headers: Dict[str, str]
|
||||
status_code: int
|
||||
screenshot: Optional[str] = None
|
||||
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
|
||||
|
||||
class Config:
|
||||
arbitrary_types_allowed = True
|
||||
|
||||
class AsyncCrawlerStrategy(ABC):
|
||||
@abstractmethod
|
||||
@@ -46,7 +50,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
self.user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
|
||||
self.proxy = kwargs.get("proxy")
|
||||
self.headless = kwargs.get("headless", True)
|
||||
self.headers = {}
|
||||
self.browser_type = kwargs.get("browser_type", "chromium") # New parameter
|
||||
self.headers = kwargs.get("headers", {})
|
||||
self.sessions = {}
|
||||
self.session_ttl = 1800
|
||||
self.js_code = js_code
|
||||
@@ -59,7 +64,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
'on_execution_started': None,
|
||||
'before_goto': None,
|
||||
'after_goto': None,
|
||||
'before_return_html': None
|
||||
'before_return_html': None,
|
||||
'before_retrieve_html': None
|
||||
}
|
||||
|
||||
async def __aenter__(self):
|
||||
@@ -75,7 +81,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if self.browser is None:
|
||||
browser_args = {
|
||||
"headless": self.headless,
|
||||
# "headless": False,
|
||||
"args": [
|
||||
"--disable-gpu",
|
||||
"--disable-dev-shm-usage",
|
||||
@@ -90,7 +95,14 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
browser_args["proxy"] = proxy_settings
|
||||
|
||||
|
||||
self.browser = await self.playwright.chromium.launch(**browser_args)
|
||||
# Select the appropriate browser based on the browser_type
|
||||
if self.browser_type == "firefox":
|
||||
self.browser = await self.playwright.firefox.launch(**browser_args)
|
||||
elif self.browser_type == "webkit":
|
||||
self.browser = await self.playwright.webkit.launch(**browser_args)
|
||||
else:
|
||||
self.browser = await self.playwright.chromium.launch(**browser_args)
|
||||
|
||||
await self.execute_hook('on_browser_created', self.browser)
|
||||
|
||||
async def close(self):
|
||||
@@ -140,7 +152,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
for sid in expired_sessions:
|
||||
asyncio.create_task(self.kill_session(sid))
|
||||
|
||||
|
||||
async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
|
||||
wait_for = wait_for.strip()
|
||||
|
||||
@@ -204,6 +215,48 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Error in wait condition: {str(e)}")
|
||||
|
||||
async def process_iframes(self, page):
|
||||
# Find all iframes
|
||||
iframes = await page.query_selector_all('iframe')
|
||||
|
||||
for i, iframe in enumerate(iframes):
|
||||
try:
|
||||
# Add a unique identifier to the iframe
|
||||
await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
|
||||
|
||||
# Get the frame associated with this iframe
|
||||
frame = await iframe.content_frame()
|
||||
|
||||
if frame:
|
||||
# Wait for the frame to load
|
||||
await frame.wait_for_load_state('load', timeout=30000) # 30 seconds timeout
|
||||
|
||||
# Extract the content of the iframe's body
|
||||
iframe_content = await frame.evaluate('() => document.body.innerHTML')
|
||||
|
||||
# Generate a unique class name for this iframe
|
||||
class_name = f'extracted-iframe-content-{i}'
|
||||
|
||||
# Replace the iframe with a div containing the extracted content
|
||||
_iframe = iframe_content.replace('`', '\\`')
|
||||
await page.evaluate(f"""
|
||||
() => {{
|
||||
const iframe = document.getElementById('iframe-{i}');
|
||||
const div = document.createElement('div');
|
||||
div.innerHTML = `{_iframe}`;
|
||||
div.className = '{class_name}';
|
||||
iframe.replaceWith(div);
|
||||
}}
|
||||
""")
|
||||
else:
|
||||
print(f"Warning: Could not access content frame for iframe {i}")
|
||||
except Exception as e:
|
||||
print(f"Error processing iframe {i}: {str(e)}")
|
||||
|
||||
# Return the page object
|
||||
return page
|
||||
|
||||
|
||||
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
|
||||
response_headers = {}
|
||||
status_code = None
|
||||
@@ -248,7 +301,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
if not kwargs.get("js_only", False):
|
||||
await self.execute_hook('before_goto', page)
|
||||
response = await page.goto(url, wait_until="domcontentloaded", timeout=60000)
|
||||
response = await page.goto(url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000))
|
||||
await self.execute_hook('after_goto', page)
|
||||
|
||||
# Get status code and headers
|
||||
@@ -258,6 +311,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
status_code = 200
|
||||
response_headers = {}
|
||||
|
||||
|
||||
await page.wait_for_selector('body')
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
|
||||
@@ -291,12 +345,89 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
wait_for = kwargs.get("wait_for")
|
||||
if wait_for:
|
||||
try:
|
||||
await self.smart_wait(page, wait_for, timeout=kwargs.get("timeout", 30000))
|
||||
await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Wait condition failed: {str(e)}")
|
||||
|
||||
# Check if kwargs has screenshot=True then take screenshot
|
||||
screenshot_data = None
|
||||
if kwargs.get("screenshot"):
|
||||
screenshot_data = await self.take_screenshot(url)
|
||||
|
||||
|
||||
# New code to update image dimensions
|
||||
update_image_dimensions_js = """
|
||||
() => {
|
||||
return new Promise((resolve) => {
|
||||
const filterImage = (img) => {
|
||||
// Filter out images that are too small
|
||||
if (img.width < 100 && img.height < 100) return false;
|
||||
|
||||
// Filter out images that are not visible
|
||||
const rect = img.getBoundingClientRect();
|
||||
if (rect.width === 0 || rect.height === 0) return false;
|
||||
|
||||
// Filter out images with certain class names (e.g., icons, thumbnails)
|
||||
if (img.classList.contains('icon') || img.classList.contains('thumbnail')) return false;
|
||||
|
||||
// Filter out images with certain patterns in their src (e.g., placeholder images)
|
||||
if (img.src.includes('placeholder') || img.src.includes('icon')) return false;
|
||||
|
||||
return true;
|
||||
};
|
||||
|
||||
const images = Array.from(document.querySelectorAll('img')).filter(filterImage);
|
||||
let imagesLeft = images.length;
|
||||
|
||||
if (imagesLeft === 0) {
|
||||
resolve();
|
||||
return;
|
||||
}
|
||||
|
||||
const checkImage = (img) => {
|
||||
if (img.complete && img.naturalWidth !== 0) {
|
||||
img.setAttribute('width', img.naturalWidth);
|
||||
img.setAttribute('height', img.naturalHeight);
|
||||
imagesLeft--;
|
||||
if (imagesLeft === 0) resolve();
|
||||
}
|
||||
};
|
||||
|
||||
images.forEach(img => {
|
||||
checkImage(img);
|
||||
if (!img.complete) {
|
||||
img.onload = () => {
|
||||
checkImage(img);
|
||||
};
|
||||
img.onerror = () => {
|
||||
imagesLeft--;
|
||||
if (imagesLeft === 0) resolve();
|
||||
};
|
||||
}
|
||||
});
|
||||
|
||||
// Fallback timeout of 5 seconds
|
||||
setTimeout(() => resolve(), 5000);
|
||||
});
|
||||
}
|
||||
"""
|
||||
await page.evaluate(update_image_dimensions_js)
|
||||
|
||||
# Wait a bit for any onload events to complete
|
||||
await page.wait_for_timeout(100)
|
||||
|
||||
# Process iframes
|
||||
if kwargs.get("process_iframes", False):
|
||||
page = await self.process_iframes(page)
|
||||
|
||||
await self.execute_hook('before_retrieve_html', page)
|
||||
# Check if delay_before_return_html is set then wait for that time
|
||||
delay_before_return_html = kwargs.get("delay_before_return_html")
|
||||
if delay_before_return_html:
|
||||
await asyncio.sleep(delay_before_return_html)
|
||||
|
||||
html = await page.content()
|
||||
page = await self.execute_hook('before_return_html', page, html)
|
||||
await self.execute_hook('before_return_html', page, html)
|
||||
|
||||
if self.verbose:
|
||||
print(f"[LOG] ✅ Crawled {url} successfully!")
|
||||
@@ -312,7 +443,20 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
"status_code": status_code
|
||||
}, f)
|
||||
|
||||
response = AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
|
||||
|
||||
async def get_delayed_content(delay: float = 5.0) -> str:
|
||||
if self.verbose:
|
||||
print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
|
||||
await asyncio.sleep(delay)
|
||||
return await page.content()
|
||||
|
||||
response = AsyncCrawlResponse(
|
||||
html=html,
|
||||
response_headers=response_headers,
|
||||
status_code=status_code,
|
||||
screenshot=screenshot_data,
|
||||
get_delayed_content=get_delayed_content
|
||||
)
|
||||
return response
|
||||
except Error as e:
|
||||
raise Error(f"Failed to crawl {url}: {str(e)}")
|
||||
@@ -370,7 +514,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
except Error as e:
|
||||
raise Error(f"Failed to execute JavaScript or wait for condition in session {session_id}: {str(e)}")
|
||||
|
||||
|
||||
async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
|
||||
semaphore_count = kwargs.get('semaphore_count', calculate_semaphore_count())
|
||||
semaphore = asyncio.Semaphore(semaphore_count)
|
||||
@@ -383,11 +526,13 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
return [result if not isinstance(result, Exception) else str(result) for result in results]
|
||||
|
||||
async def take_screenshot(self, url: str) -> str:
|
||||
async def take_screenshot(self, url: str, wait_time = 1000) -> str:
|
||||
async with await self.browser.new_context(user_agent=self.user_agent) as context:
|
||||
page = await context.new_page()
|
||||
try:
|
||||
await page.goto(url, wait_until="domcontentloaded")
|
||||
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
|
||||
# Wait for a specified time (default is 1 second)
|
||||
await page.wait_for_timeout(wait_time)
|
||||
screenshot = await page.screenshot(full_page=True)
|
||||
return base64.b64encode(screenshot).decode('utf-8')
|
||||
except Exception as e:
|
||||
|
||||
@@ -29,14 +29,31 @@ class AsyncDatabaseManager:
|
||||
)
|
||||
''')
|
||||
await db.commit()
|
||||
await self.update_db_schema()
|
||||
|
||||
async def aalter_db_add_screenshot(self, new_column: str = "media"):
|
||||
async def update_db_schema(self):
|
||||
async with aiosqlite.connect(self.db_path) as db:
|
||||
# Check if the 'media' column exists
|
||||
cursor = await db.execute("PRAGMA table_info(crawled_data)")
|
||||
columns = await cursor.fetchall()
|
||||
column_names = [column[1] for column in columns]
|
||||
|
||||
if 'media' not in column_names:
|
||||
await self.aalter_db_add_column('media')
|
||||
|
||||
# Check for other missing columns and add them if necessary
|
||||
for column in ['links', 'metadata', 'screenshot']:
|
||||
if column not in column_names:
|
||||
await self.aalter_db_add_column(column)
|
||||
|
||||
async def aalter_db_add_column(self, new_column: str):
|
||||
try:
|
||||
async with aiosqlite.connect(self.db_path) as db:
|
||||
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
|
||||
await db.commit()
|
||||
print(f"Added column '{new_column}' to the database.")
|
||||
except Exception as e:
|
||||
print(f"Error altering database to add screenshot column: {e}")
|
||||
print(f"Error altering database to add {new_column} column: {e}")
|
||||
|
||||
async def aget_cached_url(self, url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
|
||||
try:
|
||||
|
||||
@@ -23,17 +23,17 @@ class AsyncWebCrawler:
|
||||
self,
|
||||
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
|
||||
always_by_pass_cache: bool = False,
|
||||
verbose: bool = False,
|
||||
**kwargs,
|
||||
):
|
||||
self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
|
||||
verbose=verbose
|
||||
**kwargs
|
||||
)
|
||||
self.always_by_pass_cache = always_by_pass_cache
|
||||
self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
|
||||
os.makedirs(self.crawl4ai_folder, exist_ok=True)
|
||||
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
|
||||
self.ready = False
|
||||
self.verbose = verbose
|
||||
self.verbose = kwargs.get("verbose", False)
|
||||
|
||||
async def __aenter__(self):
|
||||
await self.crawler_strategy.__aenter__()
|
||||
@@ -202,11 +202,11 @@ class AsyncWebCrawler:
|
||||
)
|
||||
|
||||
if result is None:
|
||||
raise ValueError(f"Failed to extract content from the website: {url}")
|
||||
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}")
|
||||
except InvalidCSSSelectorError as e:
|
||||
raise ValueError(str(e))
|
||||
except Exception as e:
|
||||
raise ValueError(f"Failed to extract content from the website: {url}, error: {str(e)}")
|
||||
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}")
|
||||
|
||||
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
|
||||
markdown = sanitize_input_encode(result.get("markdown", ""))
|
||||
|
||||
@@ -16,8 +16,6 @@ from .utils import (
|
||||
CustomHTML2Text
|
||||
)
|
||||
|
||||
|
||||
|
||||
class ContentScrappingStrategy(ABC):
|
||||
@abstractmethod
|
||||
def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
|
||||
@@ -129,7 +127,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
|
||||
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
||||
# Remove . from format
|
||||
image_format = image_format.strip('.')
|
||||
image_format = image_format.strip('.').split('?')[0]
|
||||
score = 0
|
||||
if height_value:
|
||||
if height_unit == 'px' and height_value > 150:
|
||||
@@ -158,6 +156,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
return None
|
||||
return {
|
||||
'src': img.get('src', ''),
|
||||
'data-src': img.get('data-src', ''),
|
||||
'alt': img.get('alt', ''),
|
||||
'desc': find_closest_parent_with_useful_text(img),
|
||||
'score': score,
|
||||
@@ -170,10 +169,12 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
if isinstance(element, Comment):
|
||||
element.extract()
|
||||
return False
|
||||
|
||||
# if element.name == 'img':
|
||||
# process_image(element, url, 0, 1)
|
||||
# return True
|
||||
|
||||
if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
|
||||
if element.name == 'img':
|
||||
process_image(element, url, 0, 1)
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
@@ -273,11 +274,14 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
# Replace base64 data with empty string
|
||||
img['src'] = base64_pattern.sub('', src)
|
||||
cleaned_html = str(body).replace('\n\n', '\n').replace(' ', ' ')
|
||||
cleaned_html = sanitize_html(cleaned_html)
|
||||
|
||||
h = CustomHTML2Text()
|
||||
h.ignore_links = True
|
||||
markdown = h.handle(cleaned_html)
|
||||
h.body_width = 0
|
||||
try:
|
||||
markdown = h.handle(cleaned_html)
|
||||
except Exception as e:
|
||||
markdown = h.handle(sanitize_html(cleaned_html))
|
||||
markdown = markdown.replace(' ```', '```')
|
||||
|
||||
try:
|
||||
@@ -286,6 +290,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
print('Error extracting metadata:', str(e))
|
||||
meta = {}
|
||||
|
||||
cleaned_html = sanitize_html(cleaned_html)
|
||||
return {
|
||||
'markdown': markdown,
|
||||
'cleaned_html': cleaned_html,
|
||||
|
||||
@@ -80,6 +80,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
self.word_token_rate = kwargs.get("word_token_rate", WORD_TOKEN_RATE)
|
||||
self.apply_chunking = kwargs.get("apply_chunking", True)
|
||||
self.base_url = kwargs.get("base_url", None)
|
||||
self.extra_args = kwargs.get("extra_args", {})
|
||||
if not self.apply_chunking:
|
||||
self.chunk_token_threshold = 1e9
|
||||
|
||||
@@ -111,7 +112,13 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
"{" + variable + "}", variable_values[variable]
|
||||
)
|
||||
|
||||
response = perform_completion_with_backoff(self.provider, prompt_with_variables, self.api_token, base_url=self.base_url) # , json_response=self.extract_type == "schema")
|
||||
response = perform_completion_with_backoff(
|
||||
self.provider,
|
||||
prompt_with_variables,
|
||||
self.api_token,
|
||||
base_url=self.base_url,
|
||||
extra_args = self.extra_args
|
||||
) # , json_response=self.extract_type == "schema")
|
||||
try:
|
||||
blocks = extract_xml_data(["blocks"], response.choices[0].message.content)['blocks']
|
||||
blocks = json.loads(blocks)
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
PROMPT_EXTRACT_BLOCKS = """YHere is the URL of the webpage:
|
||||
PROMPT_EXTRACT_BLOCKS = """Here is the URL of the webpage:
|
||||
<url>{URL}</url>
|
||||
|
||||
And here is the cleaned HTML content of that webpage:
|
||||
@@ -79,7 +79,7 @@ To generate the JSON objects:
|
||||
2. For each block:
|
||||
a. Assign it an index based on its order in the content.
|
||||
b. Analyze the content and generate ONE semantic tag that describe what the block is about.
|
||||
c. Extract the text content, EXACTLY SAME AS GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
|
||||
c. Extract the text content, EXACTLY SAME AS THE GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
|
||||
|
||||
3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content.
|
||||
|
||||
|
||||
3
crawl4ai/scraper/__init__.py
Normal file
3
crawl4ai/scraper/__init__.py
Normal file
@@ -0,0 +1,3 @@
|
||||
from .async_web_scraper import AsyncWebScraper
|
||||
from .bfs_scraper_strategy import BFSScraperStrategy
|
||||
from .filters import URLFilter, FilterChain, URLPatternFilter, ContentTypeFilter
|
||||
123
crawl4ai/scraper/async_web_scraper.py
Normal file
123
crawl4ai/scraper/async_web_scraper.py
Normal file
@@ -0,0 +1,123 @@
|
||||
from typing import Union, AsyncGenerator, Optional
|
||||
from .scraper_strategy import ScraperStrategy
|
||||
from .models import ScraperResult, CrawlResult
|
||||
from ..async_webcrawler import AsyncWebCrawler
|
||||
import logging
|
||||
from dataclasses import dataclass
|
||||
from contextlib import asynccontextmanager
|
||||
|
||||
@dataclass
|
||||
class ScrapingProgress:
|
||||
"""Tracks the progress of a scraping operation."""
|
||||
processed_urls: int = 0
|
||||
failed_urls: int = 0
|
||||
current_url: Optional[str] = None
|
||||
|
||||
class AsyncWebScraper:
|
||||
"""
|
||||
A high-level web scraper that combines an async crawler with a scraping strategy.
|
||||
|
||||
Args:
|
||||
crawler (AsyncWebCrawler): The async web crawler implementation
|
||||
strategy (ScraperStrategy): The scraping strategy to use
|
||||
logger (Optional[logging.Logger]): Custom logger for the scraper
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
crawler: AsyncWebCrawler,
|
||||
strategy: ScraperStrategy,
|
||||
logger: Optional[logging.Logger] = None
|
||||
):
|
||||
if not isinstance(crawler, AsyncWebCrawler):
|
||||
raise TypeError("crawler must be an instance of AsyncWebCrawler")
|
||||
if not isinstance(strategy, ScraperStrategy):
|
||||
raise TypeError("strategy must be an instance of ScraperStrategy")
|
||||
|
||||
self.crawler = crawler
|
||||
self.strategy = strategy
|
||||
self.logger = logger or logging.getLogger(__name__)
|
||||
self._progress = ScrapingProgress()
|
||||
|
||||
@property
|
||||
def progress(self) -> ScrapingProgress:
|
||||
"""Get current scraping progress."""
|
||||
return self._progress
|
||||
|
||||
@asynccontextmanager
|
||||
async def _error_handling_context(self, url: str):
|
||||
"""Context manager for handling errors during scraping."""
|
||||
try:
|
||||
yield
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error scraping {url}: {str(e)}")
|
||||
self._progress.failed_urls += 1
|
||||
raise
|
||||
|
||||
async def ascrape(
|
||||
self,
|
||||
url: str,
|
||||
parallel_processing: bool = True,
|
||||
stream: bool = False
|
||||
) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
|
||||
"""
|
||||
Scrape a website starting from the given URL.
|
||||
|
||||
Args:
|
||||
url: Starting URL for scraping
|
||||
parallel_processing: Whether to process URLs in parallel
|
||||
stream: If True, yield results as they come; if False, collect all results
|
||||
|
||||
Returns:
|
||||
Either an async generator yielding CrawlResults or a final ScraperResult
|
||||
"""
|
||||
self._progress = ScrapingProgress() # Reset progress
|
||||
|
||||
async with self._error_handling_context(url):
|
||||
if stream:
|
||||
return self._ascrape_yielding(url, parallel_processing)
|
||||
return await self._ascrape_collecting(url, parallel_processing)
|
||||
|
||||
async def _ascrape_yielding(
|
||||
self,
|
||||
url: str,
|
||||
parallel_processing: bool
|
||||
) -> AsyncGenerator[CrawlResult, None]:
|
||||
"""Stream scraping results as they become available."""
|
||||
try:
|
||||
result_generator = self.strategy.ascrape(url, self.crawler, parallel_processing)
|
||||
async for res in result_generator:
|
||||
self._progress.processed_urls += 1
|
||||
self._progress.current_url = res.url
|
||||
yield res
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error in streaming scrape: {str(e)}")
|
||||
raise
|
||||
|
||||
async def _ascrape_collecting(
|
||||
self,
|
||||
url: str,
|
||||
parallel_processing: bool
|
||||
) -> ScraperResult:
|
||||
"""Collect all scraping results before returning."""
|
||||
extracted_data = {}
|
||||
|
||||
try:
|
||||
result_generator = self.strategy.ascrape(url, self.crawler, parallel_processing)
|
||||
async for res in result_generator:
|
||||
self._progress.processed_urls += 1
|
||||
self._progress.current_url = res.url
|
||||
extracted_data[res.url] = res
|
||||
|
||||
return ScraperResult(
|
||||
url=url,
|
||||
crawled_urls=list(extracted_data.keys()),
|
||||
extracted_data=extracted_data,
|
||||
stats={
|
||||
'processed_urls': self._progress.processed_urls,
|
||||
'failed_urls': self._progress.failed_urls
|
||||
}
|
||||
)
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error in collecting scrape: {str(e)}")
|
||||
raise
|
||||
327
crawl4ai/scraper/bfs_scraper_strategy.py
Normal file
327
crawl4ai/scraper/bfs_scraper_strategy.py
Normal file
@@ -0,0 +1,327 @@
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Union, AsyncGenerator, Optional, Dict, Set
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime
|
||||
import asyncio
|
||||
import logging
|
||||
from urllib.parse import urljoin, urlparse, urlunparse
|
||||
from urllib.robotparser import RobotFileParser
|
||||
import validators
|
||||
import time
|
||||
from aiolimiter import AsyncLimiter
|
||||
from tenacity import retry, stop_after_attempt, wait_exponential
|
||||
from collections import defaultdict
|
||||
|
||||
from .models import ScraperResult, CrawlResult
|
||||
from .filters import FilterChain
|
||||
from .scorers import URLScorer
|
||||
from ..async_webcrawler import AsyncWebCrawler
|
||||
|
||||
@dataclass
|
||||
class CrawlStats:
|
||||
"""Statistics for the crawling process"""
|
||||
start_time: datetime
|
||||
urls_processed: int = 0
|
||||
urls_failed: int = 0
|
||||
urls_skipped: int = 0
|
||||
total_depth_reached: int = 0
|
||||
current_depth: int = 0
|
||||
robots_blocked: int = 0
|
||||
|
||||
class ScraperStrategy(ABC):
|
||||
"""Base class for scraping strategies"""
|
||||
|
||||
@abstractmethod
|
||||
async def ascrape(
|
||||
self,
|
||||
url: str,
|
||||
crawler: AsyncWebCrawler,
|
||||
parallel_processing: bool = True,
|
||||
stream: bool = False
|
||||
) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
|
||||
"""Abstract method for scraping implementation"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def can_process_url(self, url: str) -> bool:
|
||||
"""Check if URL can be processed based on strategy rules"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def shutdown(self):
|
||||
"""Clean up resources used by the strategy"""
|
||||
pass
|
||||
|
||||
class BFSScraperStrategy(ScraperStrategy):
|
||||
"""Breadth-First Search scraping strategy with politeness controls"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
max_depth: int,
|
||||
filter_chain: FilterChain,
|
||||
url_scorer: URLScorer,
|
||||
max_concurrent: int = 5,
|
||||
min_crawl_delay: int = 1,
|
||||
timeout: int = 30,
|
||||
logger: Optional[logging.Logger] = None
|
||||
):
|
||||
self.max_depth = max_depth
|
||||
self.filter_chain = filter_chain
|
||||
self.url_scorer = url_scorer
|
||||
self.max_concurrent = max_concurrent
|
||||
self.min_crawl_delay = min_crawl_delay
|
||||
self.timeout = timeout
|
||||
self.logger = logger or logging.getLogger(__name__)
|
||||
|
||||
# Crawl control
|
||||
self.stats = CrawlStats(start_time=datetime.now())
|
||||
self._cancel_event = asyncio.Event()
|
||||
self.process_external_links = False
|
||||
|
||||
# Rate limiting and politeness
|
||||
self.rate_limiter = AsyncLimiter(1, 1)
|
||||
self.last_crawl_time = defaultdict(float)
|
||||
self.robot_parsers: Dict[str, RobotFileParser] = {}
|
||||
self.domain_queues: Dict[str, asyncio.Queue] = defaultdict(asyncio.Queue)
|
||||
|
||||
async def can_process_url(self, url: str) -> bool:
|
||||
"""Check if URL can be processed based on robots.txt and filters
|
||||
This is our gatekeeper method that determines if a URL should be processed. It:
|
||||
- Validates URL format using the validators library
|
||||
- Checks robots.txt permissions for the domain
|
||||
- Applies custom filters from the filter chain
|
||||
- Updates statistics for blocked URLs
|
||||
- Returns False early if any check fails
|
||||
"""
|
||||
if not validators.url(url):
|
||||
self.logger.warning(f"Invalid URL: {url}")
|
||||
return False
|
||||
|
||||
robot_parser = await self._get_robot_parser(url)
|
||||
if robot_parser and not robot_parser.can_fetch("*", url):
|
||||
self.stats.robots_blocked += 1
|
||||
self.logger.info(f"Blocked by robots.txt: {url}")
|
||||
return False
|
||||
|
||||
return self.filter_chain.apply(url)
|
||||
|
||||
async def _get_robot_parser(self, url: str) -> Optional[RobotFileParser]:
|
||||
"""Get or create robots.txt parser for domain.
|
||||
This is our robots.txt manager that:
|
||||
- Uses domain-level caching of robot parsers
|
||||
- Creates and caches new parsers as needed
|
||||
- Handles failed robots.txt fetches gracefully
|
||||
- Returns None if robots.txt can't be fetched, allowing crawling to proceed
|
||||
"""
|
||||
domain = urlparse(url).netloc
|
||||
if domain not in self.robot_parsers:
|
||||
parser = RobotFileParser()
|
||||
try:
|
||||
robots_url = f"{urlparse(url).scheme}://{domain}/robots.txt"
|
||||
parser.set_url(robots_url)
|
||||
parser.read()
|
||||
self.robot_parsers[domain] = parser
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error fetching robots.txt for {domain}: {e}")
|
||||
return None
|
||||
return self.robot_parsers[domain]
|
||||
|
||||
@retry(stop=stop_after_attempt(3),
|
||||
wait=wait_exponential(multiplier=1, min=4, max=10))
|
||||
async def _crawl_with_retry(
|
||||
self,
|
||||
crawler: AsyncWebCrawler,
|
||||
url: str
|
||||
) -> CrawlResult:
|
||||
"""Crawl URL with retry logic"""
|
||||
try:
|
||||
async with asyncio.timeout(self.timeout):
|
||||
return await crawler.arun(url)
|
||||
except asyncio.TimeoutError:
|
||||
self.logger.error(f"Timeout crawling {url}")
|
||||
raise
|
||||
|
||||
async def process_url(
|
||||
self,
|
||||
url: str,
|
||||
depth: int,
|
||||
crawler: AsyncWebCrawler,
|
||||
queue: asyncio.PriorityQueue,
|
||||
visited: Set[str],
|
||||
depths: Dict[str, int]
|
||||
) -> Optional[CrawlResult]:
|
||||
"""Process a single URL and extract links.
|
||||
This is our main URL processing workhorse that:
|
||||
- Checks for cancellation
|
||||
- Validates URLs through can_process_url
|
||||
- Implements politeness delays per domain
|
||||
- Applies rate limiting
|
||||
- Handles crawling with retries
|
||||
- Updates various statistics
|
||||
- Processes extracted links
|
||||
- Returns the crawl result or None on failure
|
||||
"""
|
||||
|
||||
if self._cancel_event.is_set():
|
||||
return None
|
||||
|
||||
if not await self.can_process_url(url):
|
||||
self.stats.urls_skipped += 1
|
||||
return None
|
||||
|
||||
# Politeness delay
|
||||
domain = urlparse(url).netloc
|
||||
time_since_last = time.time() - self.last_crawl_time[domain]
|
||||
if time_since_last < self.min_crawl_delay:
|
||||
await asyncio.sleep(self.min_crawl_delay - time_since_last)
|
||||
self.last_crawl_time[domain] = time.time()
|
||||
|
||||
# Crawl with rate limiting
|
||||
try:
|
||||
async with self.rate_limiter:
|
||||
result = await self._crawl_with_retry(crawler, url)
|
||||
self.stats.urls_processed += 1
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error crawling {url}: {e}")
|
||||
self.stats.urls_failed += 1
|
||||
return None
|
||||
|
||||
# Process links
|
||||
await self._process_links(result, url, depth, queue, visited, depths)
|
||||
|
||||
return result
|
||||
|
||||
async def _process_links(
|
||||
self,
|
||||
result: CrawlResult,
|
||||
source_url: str,
|
||||
depth: int,
|
||||
queue: asyncio.PriorityQueue,
|
||||
visited: Set[str],
|
||||
depths: Dict[str, int]
|
||||
):
|
||||
"""Process extracted links from crawl result.
|
||||
This is our link processor that:
|
||||
Handles both internal and external links
|
||||
Normalizes URLs (removes fragments)
|
||||
Checks depth limits
|
||||
Scores URLs for priority
|
||||
Updates depth tracking
|
||||
Adds valid URLs to the queue
|
||||
Updates maximum depth statistics
|
||||
"""
|
||||
links_ro_process = result.links["internal"]
|
||||
if self.process_external_links:
|
||||
links_ro_process += result.links["external"]
|
||||
for link_type in links_ro_process:
|
||||
for link in result.links[link_type]:
|
||||
url = link['href']
|
||||
# url = urljoin(source_url, link['href'])
|
||||
# url = urlunparse(urlparse(url)._replace(fragment=""))
|
||||
|
||||
if url not in visited and await self.can_process_url(url):
|
||||
new_depth = depths[source_url] + 1
|
||||
if new_depth <= self.max_depth:
|
||||
score = self.url_scorer.score(url)
|
||||
await queue.put((score, new_depth, url))
|
||||
depths[url] = new_depth
|
||||
self.stats.total_depth_reached = max(
|
||||
self.stats.total_depth_reached,
|
||||
new_depth
|
||||
)
|
||||
|
||||
async def ascrape(
|
||||
self,
|
||||
start_url: str,
|
||||
crawler: AsyncWebCrawler,
|
||||
parallel_processing: bool = True
|
||||
) -> AsyncGenerator[CrawlResult, None]:
|
||||
"""Implement BFS crawling strategy"""
|
||||
|
||||
# Initialize crawl state
|
||||
"""
|
||||
queue: A priority queue where items are tuples of (score, depth, url)
|
||||
Score: Determines crawling priority (lower = higher priority)
|
||||
Depth: Current distance from start_url
|
||||
URL: The actual URL to crawl
|
||||
visited: Keeps track of URLs we've already seen to avoid cycles
|
||||
depths: Maps URLs to their depths from the start URL
|
||||
pending_tasks: Tracks currently running crawl tasks
|
||||
"""
|
||||
queue = asyncio.PriorityQueue()
|
||||
await queue.put((0, 0, start_url))
|
||||
visited: Set[str] = set()
|
||||
depths = {start_url: 0}
|
||||
pending_tasks = set()
|
||||
|
||||
try:
|
||||
while (not queue.empty() or pending_tasks) and not self._cancel_event.is_set():
|
||||
"""
|
||||
This sets up our main control loop which:
|
||||
- Continues while there are URLs to process (not queue.empty())
|
||||
- Or while there are tasks still running (pending_tasks)
|
||||
- Can be interrupted via cancellation (not self._cancel_event.is_set())
|
||||
"""
|
||||
# Start new tasks up to max_concurrent
|
||||
while not queue.empty() and len(pending_tasks) < self.max_concurrent:
|
||||
"""
|
||||
This section manages task creation:
|
||||
Checks if we can start more tasks (under max_concurrent limit)
|
||||
Gets the next URL from the priority queue
|
||||
Marks URLs as visited immediately to prevent duplicates
|
||||
Updates current depth in stats
|
||||
Either:
|
||||
Creates a new async task (parallel mode)
|
||||
Processes URL directly (sequential mode)
|
||||
"""
|
||||
_, depth, url = await queue.get()
|
||||
if url not in visited:
|
||||
visited.add(url)
|
||||
self.stats.current_depth = depth
|
||||
|
||||
if parallel_processing:
|
||||
task = asyncio.create_task(
|
||||
self.process_url(url, depth, crawler, queue, visited, depths)
|
||||
)
|
||||
pending_tasks.add(task)
|
||||
else:
|
||||
result = await self.process_url(
|
||||
url, depth, crawler, queue, visited, depths
|
||||
)
|
||||
if result:
|
||||
yield result
|
||||
|
||||
# Process completed tasks
|
||||
"""
|
||||
This section manages completed tasks:
|
||||
Waits for any task to complete using asyncio.wait
|
||||
Uses FIRST_COMPLETED to handle results as soon as they're ready
|
||||
Yields successful results to the caller
|
||||
Updates pending_tasks to remove completed ones
|
||||
"""
|
||||
if pending_tasks:
|
||||
done, pending_tasks = await asyncio.wait(
|
||||
pending_tasks,
|
||||
return_when=asyncio.FIRST_COMPLETED
|
||||
)
|
||||
for task in done:
|
||||
result = await task
|
||||
if result:
|
||||
yield result
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error in crawl process: {e}")
|
||||
raise
|
||||
|
||||
finally:
|
||||
# Clean up any remaining tasks
|
||||
for task in pending_tasks:
|
||||
task.cancel()
|
||||
self.stats.end_time = datetime.now()
|
||||
|
||||
async def shutdown(self):
|
||||
"""Clean up resources and stop crawling"""
|
||||
self._cancel_event.set()
|
||||
# Clear caches and close connections
|
||||
self.robot_parsers.clear()
|
||||
self.domain_queues.clear()
|
||||
205
crawl4ai/scraper/filters.py
Normal file
205
crawl4ai/scraper/filters.py
Normal file
@@ -0,0 +1,205 @@
|
||||
# from .url_filter import URLFilter, FilterChain
|
||||
# from .content_type_filter import ContentTypeFilter
|
||||
# from .url_pattern_filter import URLPatternFilter
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import List, Pattern, Set, Union
|
||||
import re
|
||||
from urllib.parse import urlparse
|
||||
import mimetypes
|
||||
import logging
|
||||
from dataclasses import dataclass
|
||||
import fnmatch
|
||||
|
||||
@dataclass
|
||||
class FilterStats:
|
||||
"""Statistics for filter applications"""
|
||||
total_urls: int = 0
|
||||
rejected_urls: int = 0
|
||||
passed_urls: int = 0
|
||||
|
||||
class URLFilter(ABC):
|
||||
"""Base class for URL filters"""
|
||||
|
||||
def __init__(self, name: str = None):
|
||||
self.name = name or self.__class__.__name__
|
||||
self.stats = FilterStats()
|
||||
self.logger = logging.getLogger(f"urlfilter.{self.name}")
|
||||
|
||||
@abstractmethod
|
||||
def apply(self, url: str) -> bool:
|
||||
"""Apply the filter to a URL"""
|
||||
pass
|
||||
|
||||
def _update_stats(self, passed: bool):
|
||||
"""Update filter statistics"""
|
||||
self.stats.total_urls += 1
|
||||
if passed:
|
||||
self.stats.passed_urls += 1
|
||||
else:
|
||||
self.stats.rejected_urls += 1
|
||||
|
||||
class FilterChain:
|
||||
"""Chain of URL filters."""
|
||||
|
||||
def __init__(self, filters: List[URLFilter] = None):
|
||||
self.filters = filters or []
|
||||
self.stats = FilterStats()
|
||||
self.logger = logging.getLogger("urlfilter.chain")
|
||||
|
||||
def add_filter(self, filter_: URLFilter) -> 'FilterChain':
|
||||
"""Add a filter to the chain"""
|
||||
self.filters.append(filter_)
|
||||
return self # Enable method chaining
|
||||
|
||||
def apply(self, url: str) -> bool:
|
||||
"""Apply all filters in the chain"""
|
||||
self.stats.total_urls += 1
|
||||
|
||||
for filter_ in self.filters:
|
||||
if not filter_.apply(url):
|
||||
self.stats.rejected_urls += 1
|
||||
self.logger.debug(f"URL {url} rejected by {filter_.name}")
|
||||
return False
|
||||
|
||||
self.stats.passed_urls += 1
|
||||
return True
|
||||
|
||||
class URLPatternFilter(URLFilter):
|
||||
"""Filter URLs based on glob patterns or regex.
|
||||
|
||||
pattern_filter = URLPatternFilter([
|
||||
"*.example.com/*", # Glob pattern
|
||||
"*/article/*", # Path pattern
|
||||
re.compile(r"blog-\d+") # Regex pattern
|
||||
])
|
||||
|
||||
- Supports glob patterns and regex
|
||||
- Multiple patterns per filter
|
||||
- Pattern pre-compilation for performance
|
||||
"""
|
||||
|
||||
def __init__(self, patterns: Union[str, Pattern, List[Union[str, Pattern]]],
|
||||
use_glob: bool = True):
|
||||
super().__init__()
|
||||
self.patterns = [patterns] if isinstance(patterns, (str, Pattern)) else patterns
|
||||
self.use_glob = use_glob
|
||||
self._compiled_patterns = []
|
||||
|
||||
for pattern in self.patterns:
|
||||
if isinstance(pattern, str) and use_glob:
|
||||
self._compiled_patterns.append(self._glob_to_regex(pattern))
|
||||
else:
|
||||
self._compiled_patterns.append(re.compile(pattern) if isinstance(pattern, str) else pattern)
|
||||
|
||||
def _glob_to_regex(self, pattern: str) -> Pattern:
|
||||
"""Convert glob pattern to regex"""
|
||||
return re.compile(fnmatch.translate(pattern))
|
||||
|
||||
def apply(self, url: str) -> bool:
|
||||
"""Check if URL matches any of the patterns"""
|
||||
matches = any(pattern.search(url) for pattern in self._compiled_patterns)
|
||||
self._update_stats(matches)
|
||||
return matches
|
||||
|
||||
class ContentTypeFilter(URLFilter):
|
||||
"""Filter URLs based on expected content type.
|
||||
|
||||
content_filter = ContentTypeFilter([
|
||||
"text/html",
|
||||
"application/pdf"
|
||||
], check_extension=True)
|
||||
|
||||
- Filter by MIME types
|
||||
- Extension checking
|
||||
- Support for multiple content types
|
||||
"""
|
||||
|
||||
def __init__(self, allowed_types: Union[str, List[str]],
|
||||
check_extension: bool = True):
|
||||
super().__init__()
|
||||
self.allowed_types = [allowed_types] if isinstance(allowed_types, str) else allowed_types
|
||||
self.check_extension = check_extension
|
||||
self._normalize_types()
|
||||
|
||||
def _normalize_types(self):
|
||||
"""Normalize content type strings"""
|
||||
self.allowed_types = [t.lower() for t in self.allowed_types]
|
||||
|
||||
def _check_extension(self, url: str) -> bool:
|
||||
"""Check URL's file extension"""
|
||||
ext = urlparse(url).path.split('.')[-1].lower() if '.' in urlparse(url).path else ''
|
||||
if not ext:
|
||||
return True # No extension, might be dynamic content
|
||||
|
||||
guessed_type = mimetypes.guess_type(url)[0]
|
||||
return any(allowed in (guessed_type or '').lower() for allowed in self.allowed_types)
|
||||
|
||||
def apply(self, url: str) -> bool:
|
||||
"""Check if URL's content type is allowed"""
|
||||
result = True
|
||||
if self.check_extension:
|
||||
result = self._check_extension(url)
|
||||
self._update_stats(result)
|
||||
return result
|
||||
|
||||
class DomainFilter(URLFilter):
|
||||
"""Filter URLs based on allowed/blocked domains.
|
||||
|
||||
domain_filter = DomainFilter(
|
||||
allowed_domains=["example.com", "blog.example.com"],
|
||||
blocked_domains=["ads.example.com"]
|
||||
)
|
||||
|
||||
- Allow/block specific domains
|
||||
- Subdomain support
|
||||
- Efficient domain matching
|
||||
"""
|
||||
|
||||
def __init__(self, allowed_domains: Union[str, List[str]] = None,
|
||||
blocked_domains: Union[str, List[str]] = None):
|
||||
super().__init__()
|
||||
self.allowed_domains = set(self._normalize_domains(allowed_domains)) if allowed_domains else None
|
||||
self.blocked_domains = set(self._normalize_domains(blocked_domains)) if blocked_domains else set()
|
||||
|
||||
def _normalize_domains(self, domains: Union[str, List[str]]) -> List[str]:
|
||||
"""Normalize domain strings"""
|
||||
if isinstance(domains, str):
|
||||
domains = [domains]
|
||||
return [d.lower().strip() for d in domains]
|
||||
|
||||
def _extract_domain(self, url: str) -> str:
|
||||
"""Extract domain from URL"""
|
||||
return urlparse(url).netloc.lower()
|
||||
|
||||
def apply(self, url: str) -> bool:
|
||||
"""Check if URL's domain is allowed"""
|
||||
domain = self._extract_domain(url)
|
||||
|
||||
if domain in self.blocked_domains:
|
||||
self._update_stats(False)
|
||||
return False
|
||||
|
||||
if self.allowed_domains is not None and domain not in self.allowed_domains:
|
||||
self._update_stats(False)
|
||||
return False
|
||||
|
||||
self._update_stats(True)
|
||||
return True
|
||||
|
||||
# Example usage:
|
||||
def create_common_filter_chain() -> FilterChain:
|
||||
"""Create a commonly used filter chain"""
|
||||
return FilterChain([
|
||||
URLPatternFilter([
|
||||
"*.html", "*.htm", # HTML files
|
||||
"*/article/*", "*/blog/*" # Common content paths
|
||||
]),
|
||||
ContentTypeFilter([
|
||||
"text/html",
|
||||
"application/xhtml+xml"
|
||||
]),
|
||||
DomainFilter(
|
||||
blocked_domains=["ads.*", "analytics.*"]
|
||||
)
|
||||
])
|
||||
8
crawl4ai/scraper/models.py
Normal file
8
crawl4ai/scraper/models.py
Normal file
@@ -0,0 +1,8 @@
|
||||
from pydantic import BaseModel
|
||||
from typing import List, Dict
|
||||
from ..models import CrawlResult
|
||||
|
||||
class ScraperResult(BaseModel):
|
||||
url: str
|
||||
crawled_urls: List[str]
|
||||
extracted_data: Dict[str,CrawlResult]
|
||||
268
crawl4ai/scraper/scorers.py
Normal file
268
crawl4ai/scraper/scorers.py
Normal file
@@ -0,0 +1,268 @@
|
||||
# from .url_scorer import URLScorer
|
||||
# from .keyword_relevance_scorer import KeywordRelevanceScorer
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import List, Dict, Optional, Union
|
||||
from dataclasses import dataclass
|
||||
from urllib.parse import urlparse, unquote
|
||||
import re
|
||||
from collections import defaultdict
|
||||
import math
|
||||
import logging
|
||||
|
||||
@dataclass
|
||||
class ScoringStats:
|
||||
"""Statistics for URL scoring"""
|
||||
urls_scored: int = 0
|
||||
total_score: float = 0.0
|
||||
min_score: float = float('inf')
|
||||
max_score: float = float('-inf')
|
||||
|
||||
def update(self, score: float):
|
||||
"""Update scoring statistics"""
|
||||
self.urls_scored += 1
|
||||
self.total_score += score
|
||||
self.min_score = min(self.min_score, score)
|
||||
self.max_score = max(self.max_score, score)
|
||||
|
||||
@property
|
||||
def average_score(self) -> float:
|
||||
"""Calculate average score"""
|
||||
return self.total_score / self.urls_scored if self.urls_scored > 0 else 0.0
|
||||
|
||||
class URLScorer(ABC):
|
||||
"""Base class for URL scoring strategies"""
|
||||
|
||||
def __init__(self, weight: float = 1.0, name: str = None):
|
||||
self.weight = weight
|
||||
self.name = name or self.__class__.__name__
|
||||
self.stats = ScoringStats()
|
||||
self.logger = logging.getLogger(f"urlscorer.{self.name}")
|
||||
|
||||
@abstractmethod
|
||||
def _calculate_score(self, url: str) -> float:
|
||||
"""Calculate the raw score for a URL"""
|
||||
pass
|
||||
|
||||
def score(self, url: str) -> float:
|
||||
"""Calculate the weighted score for a URL"""
|
||||
raw_score = self._calculate_score(url)
|
||||
weighted_score = raw_score * self.weight
|
||||
self.stats.update(weighted_score)
|
||||
return weighted_score
|
||||
|
||||
class CompositeScorer(URLScorer):
|
||||
"""Combines multiple scorers with weights"""
|
||||
|
||||
def __init__(self, scorers: List[URLScorer], normalize: bool = True):
|
||||
super().__init__(name="CompositeScorer")
|
||||
self.scorers = scorers
|
||||
self.normalize = normalize
|
||||
|
||||
def _calculate_score(self, url: str) -> float:
|
||||
scores = [scorer.score(url) for scorer in self.scorers]
|
||||
total_score = sum(scores)
|
||||
|
||||
if self.normalize and scores:
|
||||
total_score /= len(scores)
|
||||
|
||||
return total_score
|
||||
|
||||
class KeywordRelevanceScorer(URLScorer):
|
||||
"""Score URLs based on keyword relevance.
|
||||
|
||||
keyword_scorer = KeywordRelevanceScorer(
|
||||
keywords=["python", "programming"],
|
||||
weight=1.0,
|
||||
case_sensitive=False
|
||||
)
|
||||
|
||||
- Score based on keyword matches
|
||||
- Case sensitivity options
|
||||
- Weighted scoring
|
||||
"""
|
||||
|
||||
def __init__(self, keywords: List[str], weight: float = 1.0,
|
||||
case_sensitive: bool = False):
|
||||
super().__init__(weight=weight)
|
||||
self.keywords = keywords
|
||||
self.case_sensitive = case_sensitive
|
||||
self._compile_keywords()
|
||||
|
||||
def _compile_keywords(self):
|
||||
"""Prepare keywords for matching"""
|
||||
flags = 0 if self.case_sensitive else re.IGNORECASE
|
||||
self.patterns = [re.compile(re.escape(k), flags) for k in self.keywords]
|
||||
|
||||
def _calculate_score(self, url: str) -> float:
|
||||
"""Calculate score based on keyword matches"""
|
||||
decoded_url = unquote(url)
|
||||
total_matches = sum(
|
||||
1 for pattern in self.patterns
|
||||
if pattern.search(decoded_url)
|
||||
)
|
||||
# Normalize score between 0 and 1
|
||||
return total_matches / len(self.patterns) if self.patterns else 0.0
|
||||
|
||||
class PathDepthScorer(URLScorer):
|
||||
"""Score URLs based on their path depth.
|
||||
|
||||
path_scorer = PathDepthScorer(
|
||||
optimal_depth=3, # Preferred URL depth
|
||||
weight=0.7
|
||||
)
|
||||
|
||||
- Score based on URL path depth
|
||||
- Configurable optimal depth
|
||||
- Diminishing returns for deeper paths
|
||||
"""
|
||||
|
||||
def __init__(self, optimal_depth: int = 3, weight: float = 1.0):
|
||||
super().__init__(weight=weight)
|
||||
self.optimal_depth = optimal_depth
|
||||
|
||||
def _calculate_score(self, url: str) -> float:
|
||||
"""Calculate score based on path depth"""
|
||||
path = urlparse(url).path
|
||||
depth = len([x for x in path.split('/') if x])
|
||||
|
||||
# Score decreases as we move away from optimal depth
|
||||
distance_from_optimal = abs(depth - self.optimal_depth)
|
||||
return 1.0 / (1.0 + distance_from_optimal)
|
||||
|
||||
class ContentTypeScorer(URLScorer):
|
||||
"""Score URLs based on content type preferences.
|
||||
|
||||
content_scorer = ContentTypeScorer({
|
||||
r'\.html$': 1.0,
|
||||
r'\.pdf$': 0.8,
|
||||
r'\.xml$': 0.6
|
||||
})
|
||||
|
||||
- Score based on file types
|
||||
- Configurable type weights
|
||||
- Pattern matching support
|
||||
"""
|
||||
|
||||
def __init__(self, type_weights: Dict[str, float], weight: float = 1.0):
|
||||
super().__init__(weight=weight)
|
||||
self.type_weights = type_weights
|
||||
self._compile_patterns()
|
||||
|
||||
def _compile_patterns(self):
|
||||
"""Prepare content type patterns"""
|
||||
self.patterns = {
|
||||
re.compile(pattern): weight
|
||||
for pattern, weight in self.type_weights.items()
|
||||
}
|
||||
|
||||
def _calculate_score(self, url: str) -> float:
|
||||
"""Calculate score based on content type matching"""
|
||||
for pattern, weight in self.patterns.items():
|
||||
if pattern.search(url):
|
||||
return weight
|
||||
return 0.0
|
||||
|
||||
class FreshnessScorer(URLScorer):
|
||||
"""Score URLs based on freshness indicators.
|
||||
|
||||
freshness_scorer = FreshnessScorer(weight=0.9)
|
||||
|
||||
Score based on date indicators in URLs
|
||||
Multiple date format support
|
||||
Recency weighting"""
|
||||
|
||||
def __init__(self, weight: float = 1.0):
|
||||
super().__init__(weight=weight)
|
||||
self.date_patterns = [
|
||||
r'/(\d{4})/(\d{2})/(\d{2})/', # yyyy/mm/dd
|
||||
r'(\d{4})[-_](\d{2})[-_](\d{2})', # yyyy-mm-dd
|
||||
r'/(\d{4})/', # year only
|
||||
]
|
||||
self._compile_patterns()
|
||||
|
||||
def _compile_patterns(self):
|
||||
"""Prepare date patterns"""
|
||||
self.compiled_patterns = [re.compile(p) for p in self.date_patterns]
|
||||
|
||||
def _calculate_score(self, url: str) -> float:
|
||||
"""Calculate score based on date indicators"""
|
||||
for pattern in self.compiled_patterns:
|
||||
if match := pattern.search(url):
|
||||
year = int(match.group(1))
|
||||
# Score higher for more recent years
|
||||
return 1.0 - (2024 - year) * 0.1
|
||||
return 0.5 # Default score for URLs without dates
|
||||
|
||||
class DomainAuthorityScorer(URLScorer):
|
||||
"""Score URLs based on domain authority.
|
||||
|
||||
authority_scorer = DomainAuthorityScorer({
|
||||
"python.org": 1.0,
|
||||
"github.com": 0.9,
|
||||
"medium.com": 0.7
|
||||
})
|
||||
|
||||
Score based on domain importance
|
||||
Configurable domain weights
|
||||
Default weight for unknown domains"""
|
||||
|
||||
def __init__(self, domain_weights: Dict[str, float],
|
||||
default_weight: float = 0.5, weight: float = 1.0):
|
||||
super().__init__(weight=weight)
|
||||
self.domain_weights = domain_weights
|
||||
self.default_weight = default_weight
|
||||
|
||||
def _calculate_score(self, url: str) -> float:
|
||||
"""Calculate score based on domain authority"""
|
||||
domain = urlparse(url).netloc.lower()
|
||||
return self.domain_weights.get(domain, self.default_weight)
|
||||
|
||||
def create_balanced_scorer() -> CompositeScorer:
|
||||
"""Create a balanced composite scorer"""
|
||||
return CompositeScorer([
|
||||
KeywordRelevanceScorer(
|
||||
keywords=["article", "blog", "news", "research"],
|
||||
weight=1.0
|
||||
),
|
||||
PathDepthScorer(
|
||||
optimal_depth=3,
|
||||
weight=0.7
|
||||
),
|
||||
ContentTypeScorer(
|
||||
type_weights={
|
||||
r'\.html?$': 1.0,
|
||||
r'\.pdf$': 0.8,
|
||||
r'\.xml$': 0.6
|
||||
},
|
||||
weight=0.8
|
||||
),
|
||||
FreshnessScorer(
|
||||
weight=0.9
|
||||
)
|
||||
])
|
||||
|
||||
# Example Usage:
|
||||
"""
|
||||
# Create a composite scorer
|
||||
scorer = CompositeScorer([
|
||||
KeywordRelevanceScorer(["python", "programming"], weight=1.0),
|
||||
PathDepthScorer(optimal_depth=2, weight=0.7),
|
||||
FreshnessScorer(weight=0.8),
|
||||
DomainAuthorityScorer(
|
||||
domain_weights={
|
||||
"python.org": 1.0,
|
||||
"github.com": 0.9,
|
||||
"medium.com": 0.7
|
||||
},
|
||||
weight=0.9
|
||||
)
|
||||
])
|
||||
|
||||
# Score a URL
|
||||
score = scorer.score("https://python.org/article/2024/01/new-features")
|
||||
|
||||
# Access statistics
|
||||
print(f"Average score: {scorer.stats.average_score}")
|
||||
print(f"URLs scored: {scorer.stats.urls_scored}")
|
||||
"""
|
||||
26
crawl4ai/scraper/scraper_strategy.py
Normal file
26
crawl4ai/scraper/scraper_strategy.py
Normal file
@@ -0,0 +1,26 @@
|
||||
from abc import ABC, abstractmethod
|
||||
from .models import ScraperResult, CrawlResult
|
||||
from ..models import CrawlResult
|
||||
from ..async_webcrawler import AsyncWebCrawler
|
||||
from typing import Union, AsyncGenerator
|
||||
|
||||
class ScraperStrategy(ABC):
|
||||
@abstractmethod
|
||||
async def ascrape(self, url: str, crawler: AsyncWebCrawler, parallel_processing: bool = True, stream: bool = False) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
|
||||
"""Scrape the given URL using the specified crawler.
|
||||
|
||||
Args:
|
||||
url (str): The starting URL for the scrape.
|
||||
crawler (AsyncWebCrawler): The web crawler instance.
|
||||
parallel_processing (bool): Whether to use parallel processing. Defaults to True.
|
||||
stream (bool): If True, yields individual crawl results as they are ready;
|
||||
if False, accumulates results and returns a final ScraperResult.
|
||||
|
||||
Yields:
|
||||
CrawlResult: Individual crawl results if stream is True.
|
||||
|
||||
Returns:
|
||||
ScraperResult: A summary of the scrape results containing the final extracted data
|
||||
and the list of crawled URLs if stream is False.
|
||||
"""
|
||||
pass
|
||||
@@ -131,7 +131,7 @@ def split_and_parse_json_objects(json_string):
|
||||
return parsed_objects, unparsed_segments
|
||||
|
||||
def sanitize_html(html):
|
||||
# Replace all weird and special characters with an empty string
|
||||
# Replace all unwanted and special characters with an empty string
|
||||
sanitized_html = html
|
||||
# sanitized_html = re.sub(r'[^\w\s.,;:!?=\[\]{}()<>\/\\\-"]', '', html)
|
||||
|
||||
@@ -301,7 +301,7 @@ def get_content_of_website(url, html, word_count_threshold = MIN_WORD_THRESHOLD,
|
||||
if tag.name != 'img':
|
||||
tag.attrs = {}
|
||||
|
||||
# Extract all img tgas inti [{src: '', alt: ''}]
|
||||
# Extract all img tgas int0 [{src: '', alt: ''}]
|
||||
media = {
|
||||
'images': [],
|
||||
'videos': [],
|
||||
@@ -339,7 +339,7 @@ def get_content_of_website(url, html, word_count_threshold = MIN_WORD_THRESHOLD,
|
||||
img.decompose()
|
||||
|
||||
|
||||
# Create a function that replace content of all"pre" tage with its inner text
|
||||
# Create a function that replace content of all"pre" tag with its inner text
|
||||
def replace_pre_tags_with_text(node):
|
||||
for child in node.find_all('pre'):
|
||||
# set child inner html to its text
|
||||
@@ -502,7 +502,7 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
|
||||
current_tag = tag
|
||||
while current_tag:
|
||||
current_tag = current_tag.parent
|
||||
# Get the text content of the parent tag
|
||||
# Get the text content from the parent tag
|
||||
if current_tag:
|
||||
text_content = current_tag.get_text(separator=' ',strip=True)
|
||||
# Check if the text content has at least word_count_threshold
|
||||
@@ -511,88 +511,88 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
|
||||
return None
|
||||
|
||||
def process_image(img, url, index, total_images):
|
||||
#Check if an image has valid display and inside undesired html elements
|
||||
def is_valid_image(img, parent, parent_classes):
|
||||
style = img.get('style', '')
|
||||
src = img.get('src', '')
|
||||
classes_to_check = ['button', 'icon', 'logo']
|
||||
tags_to_check = ['button', 'input']
|
||||
return all([
|
||||
'display:none' not in style,
|
||||
src,
|
||||
not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
|
||||
parent.name not in tags_to_check
|
||||
])
|
||||
#Check if an image has valid display and inside undesired html elements
|
||||
def is_valid_image(img, parent, parent_classes):
|
||||
style = img.get('style', '')
|
||||
src = img.get('src', '')
|
||||
classes_to_check = ['button', 'icon', 'logo']
|
||||
tags_to_check = ['button', 'input']
|
||||
return all([
|
||||
'display:none' not in style,
|
||||
src,
|
||||
not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
|
||||
parent.name not in tags_to_check
|
||||
])
|
||||
|
||||
#Score an image for it's usefulness
|
||||
def score_image_for_usefulness(img, base_url, index, images_count):
|
||||
# Function to parse image height/width value and units
|
||||
def parse_dimension(dimension):
|
||||
if dimension:
|
||||
match = re.match(r"(\d+)(\D*)", dimension)
|
||||
if match:
|
||||
number = int(match.group(1))
|
||||
unit = match.group(2) or 'px' # Default unit is 'px' if not specified
|
||||
return number, unit
|
||||
return None, None
|
||||
#Score an image for it's usefulness
|
||||
def score_image_for_usefulness(img, base_url, index, images_count):
|
||||
# Function to parse image height/width value and units
|
||||
def parse_dimension(dimension):
|
||||
if dimension:
|
||||
match = re.match(r"(\d+)(\D*)", dimension)
|
||||
if match:
|
||||
number = int(match.group(1))
|
||||
unit = match.group(2) or 'px' # Default unit is 'px' if not specified
|
||||
return number, unit
|
||||
return None, None
|
||||
|
||||
# Fetch image file metadata to extract size and extension
|
||||
def fetch_image_file_size(img, base_url):
|
||||
#If src is relative path construct full URL, if not it may be CDN URL
|
||||
img_url = urljoin(base_url,img.get('src'))
|
||||
try:
|
||||
response = requests.head(img_url)
|
||||
if response.status_code == 200:
|
||||
return response.headers.get('Content-Length',None)
|
||||
else:
|
||||
print(f"Failed to retrieve file size for {img_url}")
|
||||
return None
|
||||
except InvalidSchema as e:
|
||||
# Fetch image file metadata to extract size and extension
|
||||
def fetch_image_file_size(img, base_url):
|
||||
#If src is relative path construct full URL, if not it may be CDN URL
|
||||
img_url = urljoin(base_url,img.get('src'))
|
||||
try:
|
||||
response = requests.head(img_url)
|
||||
if response.status_code == 200:
|
||||
return response.headers.get('Content-Length',None)
|
||||
else:
|
||||
print(f"Failed to retrieve file size for {img_url}")
|
||||
return None
|
||||
finally:
|
||||
return
|
||||
except InvalidSchema as e:
|
||||
return None
|
||||
finally:
|
||||
return
|
||||
|
||||
image_height = img.get('height')
|
||||
height_value, height_unit = parse_dimension(image_height)
|
||||
image_width = img.get('width')
|
||||
width_value, width_unit = parse_dimension(image_width)
|
||||
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
|
||||
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
||||
# Remove . from format
|
||||
image_format = image_format.strip('.')
|
||||
score = 0
|
||||
if height_value:
|
||||
if height_unit == 'px' and height_value > 150:
|
||||
score += 1
|
||||
if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
|
||||
score += 1
|
||||
if width_value:
|
||||
if width_unit == 'px' and width_value > 150:
|
||||
score += 1
|
||||
if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
|
||||
score += 1
|
||||
if image_size > 10000:
|
||||
image_height = img.get('height')
|
||||
height_value, height_unit = parse_dimension(image_height)
|
||||
image_width = img.get('width')
|
||||
width_value, width_unit = parse_dimension(image_width)
|
||||
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
|
||||
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
||||
# Remove . from format
|
||||
image_format = image_format.strip('.')
|
||||
score = 0
|
||||
if height_value:
|
||||
if height_unit == 'px' and height_value > 150:
|
||||
score += 1
|
||||
if img.get('alt') != '':
|
||||
score+=1
|
||||
if any(image_format==format for format in ['jpg','png','webp']):
|
||||
score+=1
|
||||
if index/images_count<0.5:
|
||||
score+=1
|
||||
return score
|
||||
if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
|
||||
score += 1
|
||||
if width_value:
|
||||
if width_unit == 'px' and width_value > 150:
|
||||
score += 1
|
||||
if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
|
||||
score += 1
|
||||
if image_size > 10000:
|
||||
score += 1
|
||||
if img.get('alt') != '':
|
||||
score+=1
|
||||
if any(image_format==format for format in ['jpg','png','webp']):
|
||||
score+=1
|
||||
if index/images_count<0.5:
|
||||
score+=1
|
||||
return score
|
||||
|
||||
if not is_valid_image(img, img.parent, img.parent.get('class', [])):
|
||||
return None
|
||||
score = score_image_for_usefulness(img, url, index, total_images)
|
||||
if score <= IMAGE_SCORE_THRESHOLD:
|
||||
return None
|
||||
return {
|
||||
'src': img.get('src', ''),
|
||||
'alt': img.get('alt', ''),
|
||||
'desc': find_closest_parent_with_useful_text(img),
|
||||
'score': score,
|
||||
'type': 'image'
|
||||
}
|
||||
if not is_valid_image(img, img.parent, img.parent.get('class', [])):
|
||||
return None
|
||||
score = score_image_for_usefulness(img, url, index, total_images)
|
||||
if score <= IMAGE_SCORE_THRESHOLD:
|
||||
return None
|
||||
return {
|
||||
'src': img.get('src', '').replace('\\"', '"').strip(),
|
||||
'alt': img.get('alt', ''),
|
||||
'desc': find_closest_parent_with_useful_text(img),
|
||||
'score': score,
|
||||
'type': 'image'
|
||||
}
|
||||
|
||||
def process_element(element: element.PageElement) -> bool:
|
||||
try:
|
||||
@@ -775,7 +775,14 @@ def extract_xml_data(tags, string):
|
||||
return data
|
||||
|
||||
# Function to perform the completion with exponential backoff
|
||||
def perform_completion_with_backoff(provider, prompt_with_variables, api_token, json_response = False, base_url=None):
|
||||
def perform_completion_with_backoff(
|
||||
provider,
|
||||
prompt_with_variables,
|
||||
api_token,
|
||||
json_response = False,
|
||||
base_url=None,
|
||||
**kwargs
|
||||
):
|
||||
from litellm import completion
|
||||
from litellm.exceptions import RateLimitError
|
||||
max_attempts = 3
|
||||
@@ -784,6 +791,9 @@ def perform_completion_with_backoff(provider, prompt_with_variables, api_token,
|
||||
extra_args = {}
|
||||
if json_response:
|
||||
extra_args["response_format"] = { "type": "json_object" }
|
||||
|
||||
if kwargs.get("extra_args"):
|
||||
extra_args.update(kwargs["extra_args"])
|
||||
|
||||
for attempt in range(max_attempts):
|
||||
try:
|
||||
|
||||
@@ -12,6 +12,7 @@ from typing import List
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
from .config import *
|
||||
import warnings
|
||||
import json
|
||||
warnings.filterwarnings("ignore", message='Field "model_name" has conflict with protected namespace "model_".')
|
||||
|
||||
|
||||
|
||||
@@ -10,6 +10,7 @@ import time
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
from typing import Dict
|
||||
from bs4 import BeautifulSoup
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
@@ -18,6 +19,8 @@ from crawl4ai.extraction_strategy import (
|
||||
LLMExtractionStrategy,
|
||||
)
|
||||
|
||||
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
||||
|
||||
print("Crawl4AI: Advanced Web Crawling and Data Extraction")
|
||||
print("GitHub Repository: https://github.com/unclecode/crawl4ai")
|
||||
print("Twitter: @unclecode")
|
||||
@@ -30,7 +33,7 @@ async def simple_crawl():
|
||||
result = await crawler.arun(url="https://www.nbcnews.com/business")
|
||||
print(result.markdown[:500]) # Print first 500 characters
|
||||
|
||||
async def js_and_css():
|
||||
async def simple_example_with_running_js_code():
|
||||
print("\n--- Executing JavaScript and Using CSS Selectors ---")
|
||||
# New code to handle the wait_for parameter
|
||||
wait_for = """() => {
|
||||
@@ -47,12 +50,21 @@ async def js_and_css():
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
js_code=js_code,
|
||||
# css_selector="article.tease-card",
|
||||
# wait_for=wait_for,
|
||||
bypass_cache=True,
|
||||
)
|
||||
print(result.markdown[:500]) # Print first 500 characters
|
||||
|
||||
async def simple_example_with_css_selector():
|
||||
print("\n--- Using CSS Selectors ---")
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
css_selector=".wide-tease-item__description",
|
||||
bypass_cache=True,
|
||||
)
|
||||
print(result.markdown[:500]) # Print first 500 characters
|
||||
|
||||
async def use_proxy():
|
||||
print("\n--- Using a Proxy ---")
|
||||
print(
|
||||
@@ -66,6 +78,28 @@ async def use_proxy():
|
||||
# )
|
||||
# print(result.markdown[:500]) # Print first 500 characters
|
||||
|
||||
async def capture_and_save_screenshot(url: str, output_path: str):
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
screenshot=True,
|
||||
bypass_cache=True
|
||||
)
|
||||
|
||||
if result.success and result.screenshot:
|
||||
import base64
|
||||
|
||||
# Decode the base64 screenshot data
|
||||
screenshot_data = base64.b64decode(result.screenshot)
|
||||
|
||||
# Save the screenshot as a JPEG file
|
||||
with open(output_path, 'wb') as f:
|
||||
f.write(screenshot_data)
|
||||
|
||||
print(f"Screenshot saved successfully to {output_path}")
|
||||
else:
|
||||
print("Failed to capture screenshot")
|
||||
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||
@@ -73,27 +107,30 @@ class OpenAIModelFee(BaseModel):
|
||||
..., description="Fee for output token for the OpenAI model."
|
||||
)
|
||||
|
||||
async def extract_structured_data_using_llm():
|
||||
print("\n--- Extracting Structured Data with OpenAI ---")
|
||||
print(
|
||||
"Note: Set your OpenAI API key as an environment variable to run this example."
|
||||
)
|
||||
if not os.getenv("OPENAI_API_KEY"):
|
||||
print("OpenAI API key not found. Skipping this example.")
|
||||
async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
|
||||
print(f"\n--- Extracting Structured Data with {provider} ---")
|
||||
|
||||
if api_token is None and provider != "ollama":
|
||||
print(f"API token is required for {provider}. Skipping this example.")
|
||||
return
|
||||
|
||||
extra_args = {}
|
||||
if extra_headers:
|
||||
extra_args["extra_headers"] = extra_headers
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://openai.com/api/pricing/",
|
||||
word_count_threshold=1,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv("OPENAI_API_KEY"),
|
||||
provider=provider,
|
||||
api_token=api_token,
|
||||
schema=OpenAIModelFee.schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
||||
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
|
||||
extra_args=extra_args
|
||||
),
|
||||
bypass_cache=True,
|
||||
)
|
||||
@@ -320,6 +357,28 @@ async def crawl_dynamic_content_pages_method_3():
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||
|
||||
async def crawl_custom_browser_type():
|
||||
# Use Firefox
|
||||
start = time.time()
|
||||
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
|
||||
print(result.markdown[:500])
|
||||
print("Time taken: ", time.time() - start)
|
||||
|
||||
# Use WebKit
|
||||
start = time.time()
|
||||
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
|
||||
print(result.markdown[:500])
|
||||
print("Time taken: ", time.time() - start)
|
||||
|
||||
# Use Chromium (default)
|
||||
start = time.time()
|
||||
async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
|
||||
print(result.markdown[:500])
|
||||
print("Time taken: ", time.time() - start)
|
||||
|
||||
async def speed_comparison():
|
||||
# print("\n--- Speed Comparison ---")
|
||||
# print("Firecrawl (simulated):")
|
||||
@@ -387,13 +446,31 @@ async def speed_comparison():
|
||||
|
||||
async def main():
|
||||
await simple_crawl()
|
||||
await js_and_css()
|
||||
await simple_example_with_running_js_code()
|
||||
await simple_example_with_css_selector()
|
||||
await use_proxy()
|
||||
await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
|
||||
await extract_structured_data_using_css_extractor()
|
||||
|
||||
# LLM extraction examples
|
||||
await extract_structured_data_using_llm()
|
||||
await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
|
||||
await extract_structured_data_using_llm("openai/gpt-4", os.getenv("OPENAI_API_KEY"))
|
||||
await extract_structured_data_using_llm("ollama/llama3.2")
|
||||
|
||||
# You always can pass custom headers to the extraction strategy
|
||||
custom_headers = {
|
||||
"Authorization": "Bearer your-custom-token",
|
||||
"X-Custom-Header": "Some-Value"
|
||||
}
|
||||
await extract_structured_data_using_llm(extra_headers=custom_headers)
|
||||
|
||||
# await crawl_dynamic_content_pages_method_1()
|
||||
# await crawl_dynamic_content_pages_method_2()
|
||||
await crawl_dynamic_content_pages_method_3()
|
||||
|
||||
await crawl_custom_browser_type()
|
||||
|
||||
await speed_comparison()
|
||||
|
||||
|
||||
|
||||
166
docs/scrapper/async_web_scraper.md
Normal file
166
docs/scrapper/async_web_scraper.md
Normal file
@@ -0,0 +1,166 @@
|
||||
# AsyncWebScraper: Smart Web Crawling Made Easy
|
||||
|
||||
AsyncWebScraper is a powerful and flexible web scraping tool that makes it easy to collect data from websites efficiently. Whether you need to scrape a few pages or an entire website, AsyncWebScraper handles the complexity of web crawling while giving you fine-grained control over the process.
|
||||
|
||||
## How It Works
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
Start([Start]) --> Init[Initialize AsyncWebScraper\nwith Crawler and Strategy]
|
||||
Init --> InputURL[Receive URL to scrape]
|
||||
InputURL --> Decision{Stream or\nCollect?}
|
||||
|
||||
%% Streaming Path
|
||||
Decision -->|Stream| StreamInit[Initialize Streaming Mode]
|
||||
StreamInit --> StreamStrategy[Call Strategy.ascrape]
|
||||
StreamStrategy --> AsyncGen[Create Async Generator]
|
||||
AsyncGen --> ProcessURL[Process Next URL]
|
||||
ProcessURL --> FetchContent[Fetch Page Content]
|
||||
FetchContent --> Extract[Extract Data]
|
||||
Extract --> YieldResult[Yield CrawlResult]
|
||||
YieldResult --> CheckMore{More URLs?}
|
||||
CheckMore -->|Yes| ProcessURL
|
||||
CheckMore -->|No| StreamEnd([End Stream])
|
||||
|
||||
%% Collecting Path
|
||||
Decision -->|Collect| CollectInit[Initialize Collection Mode]
|
||||
CollectInit --> CollectStrategy[Call Strategy.ascrape]
|
||||
CollectStrategy --> CollectGen[Create Async Generator]
|
||||
CollectGen --> ProcessURLColl[Process Next URL]
|
||||
ProcessURLColl --> FetchContentColl[Fetch Page Content]
|
||||
FetchContentColl --> ExtractColl[Extract Data]
|
||||
ExtractColl --> StoreColl[Store in Dictionary]
|
||||
StoreColl --> CheckMoreColl{More URLs?}
|
||||
CheckMoreColl -->|Yes| ProcessURLColl
|
||||
CheckMoreColl -->|No| CreateResult[Create ScraperResult]
|
||||
CreateResult --> ReturnResult([Return Result])
|
||||
|
||||
%% Parallel Processing
|
||||
subgraph Parallel
|
||||
ProcessURL
|
||||
FetchContent
|
||||
Extract
|
||||
ProcessURLColl
|
||||
FetchContentColl
|
||||
ExtractColl
|
||||
end
|
||||
|
||||
%% Error Handling
|
||||
FetchContent --> ErrorCheck{Error?}
|
||||
ErrorCheck -->|Yes| LogError[Log Error]
|
||||
LogError --> UpdateStats[Update Error Stats]
|
||||
UpdateStats --> CheckMore
|
||||
ErrorCheck -->|No| Extract
|
||||
|
||||
FetchContentColl --> ErrorCheckColl{Error?}
|
||||
ErrorCheckColl -->|Yes| LogErrorColl[Log Error]
|
||||
LogErrorColl --> UpdateStatsColl[Update Error Stats]
|
||||
UpdateStatsColl --> CheckMoreColl
|
||||
ErrorCheckColl -->|No| ExtractColl
|
||||
|
||||
%% Style definitions
|
||||
classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
|
||||
classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
|
||||
classDef error fill:#ef9a9a,stroke:#000,stroke-width:2px;
|
||||
classDef start fill:#a5d6a7,stroke:#000,stroke-width:2px;
|
||||
|
||||
class Start,StreamEnd,ReturnResult start;
|
||||
class Decision,CheckMore,CheckMoreColl,ErrorCheck,ErrorCheckColl decision;
|
||||
class LogError,LogErrorColl,UpdateStats,UpdateStatsColl error;
|
||||
class ProcessURL,FetchContent,Extract,ProcessURLColl,FetchContentColl,ExtractColl process;
|
||||
```
|
||||
|
||||
AsyncWebScraper uses an intelligent crawling system that can navigate through websites following your specified strategy. It supports two main modes of operation:
|
||||
|
||||
### 1. Streaming Mode
|
||||
```python
|
||||
async for result in scraper.ascrape(url, stream=True):
|
||||
print(f"Found data on {result.url}")
|
||||
process_data(result.data)
|
||||
```
|
||||
- Perfect for processing large websites
|
||||
- Memory efficient - handles one page at a time
|
||||
- Ideal for real-time data processing
|
||||
- Great for monitoring or continuous scraping tasks
|
||||
|
||||
### 2. Collection Mode
|
||||
```python
|
||||
result = await scraper.ascrape(url)
|
||||
print(f"Scraped {len(result.crawled_urls)} pages")
|
||||
process_all_data(result.extracted_data)
|
||||
```
|
||||
- Collects all data before returning
|
||||
- Best for when you need the complete dataset
|
||||
- Easier to work with for batch processing
|
||||
- Includes comprehensive statistics
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Smart Crawling**: Automatically follows relevant links while avoiding duplicates
|
||||
- **Parallel Processing**: Scrapes multiple pages simultaneously for better performance
|
||||
- **Memory Efficient**: Choose between streaming and collecting based on your needs
|
||||
- **Error Resilient**: Continues working even if some pages fail to load
|
||||
- **Progress Tracking**: Monitor the scraping progress in real-time
|
||||
- **Customizable**: Configure crawling strategy, filters, and scoring to match your needs
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from crawl4ai.scraper import AsyncWebScraper, BFSStrategy
|
||||
from crawl4ai.async_webcrawler import AsyncWebCrawler
|
||||
|
||||
# Initialize the scraper
|
||||
crawler = AsyncWebCrawler()
|
||||
strategy = BFSStrategy(
|
||||
max_depth=2, # How deep to crawl
|
||||
url_pattern="*.example.com/*" # What URLs to follow
|
||||
)
|
||||
scraper = AsyncWebScraper(crawler, strategy)
|
||||
|
||||
# Start scraping
|
||||
async def main():
|
||||
# Collect all results
|
||||
result = await scraper.ascrape("https://example.com")
|
||||
print(f"Found {len(result.extracted_data)} pages")
|
||||
|
||||
# Or stream results
|
||||
async for page in scraper.ascrape("https://example.com", stream=True):
|
||||
print(f"Processing {page.url}")
|
||||
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Choose the Right Mode**
|
||||
- Use streaming for large websites or real-time processing
|
||||
- Use collecting for smaller sites or when you need the complete dataset
|
||||
|
||||
2. **Configure Depth**
|
||||
- Start with a small depth (2-3) and increase if needed
|
||||
- Higher depths mean exponentially more pages to crawl
|
||||
|
||||
3. **Set Appropriate Filters**
|
||||
- Use URL patterns to stay within relevant sections
|
||||
- Set content type filters to only process useful pages
|
||||
|
||||
4. **Handle Resources Responsibly**
|
||||
- Enable parallel processing for faster results
|
||||
- Consider the target website's capacity
|
||||
- Implement appropriate delays between requests
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
- **Content Aggregation**: Collect articles, blog posts, or news from multiple pages
|
||||
- **Data Extraction**: Gather product information, prices, or specifications
|
||||
- **Site Mapping**: Create a complete map of a website's structure
|
||||
- **Content Monitoring**: Track changes or updates across multiple pages
|
||||
- **Data Mining**: Extract and analyze patterns across web pages
|
||||
|
||||
## Advanced Features
|
||||
|
||||
- Custom scoring algorithms for prioritizing important pages
|
||||
- URL filters for focusing on specific site sections
|
||||
- Content type filtering for processing only relevant pages
|
||||
- Progress tracking for monitoring long-running scrapes
|
||||
|
||||
Need more help? Check out our [examples repository](https://github.com/example/crawl4ai/examples) or join our [community Discord](https://discord.gg/example).
|
||||
244
docs/scrapper/bfs_scraper_strategy.md
Normal file
244
docs/scrapper/bfs_scraper_strategy.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# BFS Scraper Strategy: Smart Web Traversal
|
||||
|
||||
The BFS (Breadth-First Search) Scraper Strategy provides an intelligent way to traverse websites systematically. It crawls websites level by level, ensuring thorough coverage while respecting web crawling etiquette.
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
Start([Start]) --> Init[Initialize BFS Strategy]
|
||||
Init --> InitStats[Initialize CrawlStats]
|
||||
InitStats --> InitQueue[Initialize Priority Queue]
|
||||
InitQueue --> AddStart[Add Start URL to Queue]
|
||||
|
||||
AddStart --> CheckState{Queue Empty or\nTasks Pending?}
|
||||
CheckState -->|No| Cleanup[Cleanup & Stats]
|
||||
Cleanup --> End([End])
|
||||
|
||||
CheckState -->|Yes| CheckCancel{Cancel\nRequested?}
|
||||
CheckCancel -->|Yes| Cleanup
|
||||
|
||||
CheckCancel -->|No| CheckConcurrent{Under Max\nConcurrent?}
|
||||
|
||||
CheckConcurrent -->|No| WaitComplete[Wait for Task Completion]
|
||||
WaitComplete --> YieldResult[Yield Result]
|
||||
YieldResult --> CheckState
|
||||
|
||||
CheckConcurrent -->|Yes| GetNextURL[Get Next URL from Queue]
|
||||
|
||||
GetNextURL --> ValidateURL{Already\nVisited?}
|
||||
ValidateURL -->|Yes| CheckState
|
||||
|
||||
ValidateURL -->|No| ProcessURL[Process URL]
|
||||
|
||||
subgraph URL_Processing [URL Processing]
|
||||
ProcessURL --> CheckValid{URL Valid?}
|
||||
CheckValid -->|No| UpdateStats[Update Skip Stats]
|
||||
|
||||
CheckValid -->|Yes| CheckRobots{Allowed by\nrobots.txt?}
|
||||
CheckRobots -->|No| UpdateRobotStats[Update Robot Stats]
|
||||
|
||||
CheckRobots -->|Yes| ApplyDelay[Apply Politeness Delay]
|
||||
ApplyDelay --> FetchContent[Fetch Content with Rate Limit]
|
||||
|
||||
FetchContent --> CheckError{Error?}
|
||||
CheckError -->|Yes| Retry{Retry\nNeeded?}
|
||||
Retry -->|Yes| FetchContent
|
||||
Retry -->|No| UpdateFailStats[Update Fail Stats]
|
||||
|
||||
CheckError -->|No| ExtractLinks[Extract & Process Links]
|
||||
ExtractLinks --> ScoreURLs[Score New URLs]
|
||||
ScoreURLs --> AddToQueue[Add to Priority Queue]
|
||||
end
|
||||
|
||||
ProcessURL --> CreateTask{Parallel\nProcessing?}
|
||||
CreateTask -->|Yes| AddTask[Add to Pending Tasks]
|
||||
CreateTask -->|No| DirectProcess[Process Directly]
|
||||
|
||||
AddTask --> CheckState
|
||||
DirectProcess --> YieldResult
|
||||
|
||||
UpdateStats --> CheckState
|
||||
UpdateRobotStats --> CheckState
|
||||
UpdateFailStats --> CheckState
|
||||
|
||||
classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
|
||||
classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
|
||||
classDef error fill:#ef9a9a,stroke:#000,stroke-width:2px;
|
||||
classDef stats fill:#a5d6a7,stroke:#000,stroke-width:2px;
|
||||
|
||||
class Start,End stats;
|
||||
class CheckState,CheckCancel,CheckConcurrent,ValidateURL,CheckValid,CheckRobots,CheckError,Retry,CreateTask decision;
|
||||
class UpdateStats,UpdateRobotStats,UpdateFailStats,InitStats,Cleanup stats;
|
||||
class ProcessURL,FetchContent,ExtractLinks,ScoreURLs process;
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
The BFS strategy crawls a website by:
|
||||
1. Starting from a root URL
|
||||
2. Processing all URLs at the current depth
|
||||
3. Moving to URLs at the next depth level
|
||||
4. Continuing until maximum depth is reached
|
||||
|
||||
This ensures systematic coverage of the website while maintaining control over the crawling process.
|
||||
|
||||
## Key Features
|
||||
|
||||
### 1. Smart URL Processing
|
||||
```python
|
||||
strategy = BFSScraperStrategy(
|
||||
max_depth=2,
|
||||
filter_chain=my_filters,
|
||||
url_scorer=my_scorer,
|
||||
max_concurrent=5
|
||||
)
|
||||
```
|
||||
- Controls crawl depth
|
||||
- Filters unwanted URLs
|
||||
- Scores URLs for priority
|
||||
- Manages concurrent requests
|
||||
|
||||
### 2. Polite Crawling
|
||||
The strategy automatically implements web crawling best practices:
|
||||
- Respects robots.txt
|
||||
- Implements rate limiting
|
||||
- Adds politeness delays
|
||||
- Manages concurrent requests
|
||||
|
||||
### 3. Link Processing Control
|
||||
```python
|
||||
strategy = BFSScraperStrategy(
|
||||
...,
|
||||
process_external_links=False # Only process internal links
|
||||
)
|
||||
```
|
||||
- Control whether to follow external links
|
||||
- Default: internal links only
|
||||
- Enable external links when needed
|
||||
|
||||
## Configuration Options
|
||||
|
||||
| Parameter | Description | Default |
|
||||
|-----------|-------------|---------|
|
||||
| max_depth | Maximum crawl depth | Required |
|
||||
| filter_chain | URL filtering rules | Required |
|
||||
| url_scorer | URL priority scoring | Required |
|
||||
| max_concurrent | Max parallel requests | 5 |
|
||||
| min_crawl_delay | Seconds between requests | 1 |
|
||||
| process_external_links | Follow external links | False |
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Set Appropriate Depth**
|
||||
- Start with smaller depths (2-3)
|
||||
- Increase based on needs
|
||||
- Consider site structure
|
||||
|
||||
2. **Configure Filters**
|
||||
- Use URL patterns
|
||||
- Filter by content type
|
||||
- Avoid unwanted sections
|
||||
|
||||
3. **Tune Performance**
|
||||
- Adjust max_concurrent
|
||||
- Set appropriate delays
|
||||
- Monitor resource usage
|
||||
|
||||
4. **Handle External Links**
|
||||
- Keep external_links=False for focused crawls
|
||||
- Enable only when needed
|
||||
- Consider additional filtering
|
||||
|
||||
## Example Usage
|
||||
|
||||
```python
|
||||
from crawl4ai.scraper import BFSScraperStrategy
|
||||
from crawl4ai.scraper.filters import FilterChain
|
||||
from crawl4ai.scraper.scorers import BasicURLScorer
|
||||
|
||||
# Configure strategy
|
||||
strategy = BFSScraperStrategy(
|
||||
max_depth=3,
|
||||
filter_chain=FilterChain([
|
||||
URLPatternFilter("*.example.com/*"),
|
||||
ContentTypeFilter(["text/html"])
|
||||
]),
|
||||
url_scorer=BasicURLScorer(),
|
||||
max_concurrent=5,
|
||||
min_crawl_delay=1,
|
||||
process_external_links=False
|
||||
)
|
||||
|
||||
# Use with AsyncWebScraper
|
||||
scraper = AsyncWebScraper(crawler, strategy)
|
||||
results = await scraper.ascrape("https://example.com")
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### 1. Site Mapping
|
||||
```python
|
||||
strategy = BFSScraperStrategy(
|
||||
max_depth=5,
|
||||
filter_chain=site_filter,
|
||||
url_scorer=depth_scorer,
|
||||
process_external_links=False
|
||||
)
|
||||
```
|
||||
Perfect for creating complete site maps or understanding site structure.
|
||||
|
||||
### 2. Content Aggregation
|
||||
```python
|
||||
strategy = BFSScraperStrategy(
|
||||
max_depth=2,
|
||||
filter_chain=content_filter,
|
||||
url_scorer=relevance_scorer,
|
||||
max_concurrent=3
|
||||
)
|
||||
```
|
||||
Ideal for collecting specific types of content (articles, products, etc.).
|
||||
|
||||
### 3. Link Analysis
|
||||
```python
|
||||
strategy = BFSScraperStrategy(
|
||||
max_depth=1,
|
||||
filter_chain=link_filter,
|
||||
url_scorer=link_scorer,
|
||||
process_external_links=True
|
||||
)
|
||||
```
|
||||
Useful for analyzing both internal and external link structures.
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Progress Monitoring
|
||||
```python
|
||||
async for result in scraper.ascrape(url):
|
||||
print(f"Current depth: {strategy.stats.current_depth}")
|
||||
print(f"Processed URLs: {strategy.stats.urls_processed}")
|
||||
```
|
||||
|
||||
### Custom URL Scoring
|
||||
```python
|
||||
class CustomScorer(URLScorer):
|
||||
def score(self, url: str) -> float:
|
||||
# Lower scores = higher priority
|
||||
return score_based_on_criteria(url)
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
1. **Slow Crawling**
|
||||
- Increase max_concurrent
|
||||
- Adjust min_crawl_delay
|
||||
- Check network conditions
|
||||
|
||||
2. **Missing Content**
|
||||
- Verify max_depth
|
||||
- Check filter settings
|
||||
- Review URL patterns
|
||||
|
||||
3. **High Resource Usage**
|
||||
- Reduce max_concurrent
|
||||
- Increase crawl delay
|
||||
- Add more specific filters
|
||||
|
||||
342
docs/scrapper/filters_scrorers.md
Normal file
342
docs/scrapper/filters_scrorers.md
Normal file
@@ -0,0 +1,342 @@
|
||||
# URL Filters and Scorers
|
||||
|
||||
The crawl4ai library provides powerful URL filtering and scoring capabilities that help you control and prioritize your web crawling. This guide explains how to use these features effectively.
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
Start([URL Input]) --> Chain[Filter Chain]
|
||||
|
||||
subgraph Chain Process
|
||||
Chain --> Pattern{URL Pattern\nFilter}
|
||||
Pattern -->|Match| Content{Content Type\nFilter}
|
||||
Pattern -->|No Match| Reject1[Reject URL]
|
||||
|
||||
Content -->|Allowed| Domain{Domain\nFilter}
|
||||
Content -->|Not Allowed| Reject2[Reject URL]
|
||||
|
||||
Domain -->|Allowed| Accept[Accept URL]
|
||||
Domain -->|Blocked| Reject3[Reject URL]
|
||||
end
|
||||
|
||||
subgraph Statistics
|
||||
Pattern --> UpdatePattern[Update Pattern Stats]
|
||||
Content --> UpdateContent[Update Content Stats]
|
||||
Domain --> UpdateDomain[Update Domain Stats]
|
||||
Accept --> UpdateChain[Update Chain Stats]
|
||||
Reject1 --> UpdateChain
|
||||
Reject2 --> UpdateChain
|
||||
Reject3 --> UpdateChain
|
||||
end
|
||||
|
||||
Accept --> End([End])
|
||||
Reject1 --> End
|
||||
Reject2 --> End
|
||||
Reject3 --> End
|
||||
|
||||
classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
|
||||
classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
|
||||
classDef reject fill:#ef9a9a,stroke:#000,stroke-width:2px;
|
||||
classDef accept fill:#a5d6a7,stroke:#000,stroke-width:2px;
|
||||
|
||||
class Start,End accept;
|
||||
class Pattern,Content,Domain decision;
|
||||
class Reject1,Reject2,Reject3 reject;
|
||||
class Chain,UpdatePattern,UpdateContent,UpdateDomain,UpdateChain process;
|
||||
```
|
||||
|
||||
## URL Filters
|
||||
|
||||
URL filters help you control which URLs are crawled. Multiple filters can be chained together to create sophisticated filtering rules.
|
||||
|
||||
### Available Filters
|
||||
|
||||
1. **URL Pattern Filter**
|
||||
```python
|
||||
pattern_filter = URLPatternFilter([
|
||||
"*.example.com/*", # Glob pattern
|
||||
"*/article/*", # Path pattern
|
||||
re.compile(r"blog-\d+") # Regex pattern
|
||||
])
|
||||
```
|
||||
- Supports glob patterns and regex
|
||||
- Multiple patterns per filter
|
||||
- Pattern pre-compilation for performance
|
||||
|
||||
2. **Content Type Filter**
|
||||
```python
|
||||
content_filter = ContentTypeFilter([
|
||||
"text/html",
|
||||
"application/pdf"
|
||||
], check_extension=True)
|
||||
```
|
||||
- Filter by MIME types
|
||||
- Extension checking
|
||||
- Support for multiple content types
|
||||
|
||||
3. **Domain Filter**
|
||||
```python
|
||||
domain_filter = DomainFilter(
|
||||
allowed_domains=["example.com", "blog.example.com"],
|
||||
blocked_domains=["ads.example.com"]
|
||||
)
|
||||
```
|
||||
- Allow/block specific domains
|
||||
- Subdomain support
|
||||
- Efficient domain matching
|
||||
|
||||
### Creating Filter Chains
|
||||
|
||||
```python
|
||||
# Create and configure a filter chain
|
||||
filter_chain = FilterChain([
|
||||
URLPatternFilter(["*.example.com/*"]),
|
||||
ContentTypeFilter(["text/html"]),
|
||||
DomainFilter(blocked_domains=["ads.*"])
|
||||
])
|
||||
|
||||
# Add more filters
|
||||
filter_chain.add_filter(
|
||||
URLPatternFilter(["*/article/*"])
|
||||
)
|
||||
```
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
Start([URL Input]) --> Composite[Composite Scorer]
|
||||
|
||||
subgraph Scoring Process
|
||||
Composite --> Keywords[Keyword Relevance]
|
||||
Composite --> Path[Path Depth]
|
||||
Composite --> Content[Content Type]
|
||||
Composite --> Fresh[Freshness]
|
||||
Composite --> Domain[Domain Authority]
|
||||
|
||||
Keywords --> KeywordScore[Calculate Score]
|
||||
Path --> PathScore[Calculate Score]
|
||||
Content --> ContentScore[Calculate Score]
|
||||
Fresh --> FreshScore[Calculate Score]
|
||||
Domain --> DomainScore[Calculate Score]
|
||||
|
||||
KeywordScore --> Weight1[Apply Weight]
|
||||
PathScore --> Weight2[Apply Weight]
|
||||
ContentScore --> Weight3[Apply Weight]
|
||||
FreshScore --> Weight4[Apply Weight]
|
||||
DomainScore --> Weight5[Apply Weight]
|
||||
end
|
||||
|
||||
Weight1 --> Combine[Combine Scores]
|
||||
Weight2 --> Combine
|
||||
Weight3 --> Combine
|
||||
Weight4 --> Combine
|
||||
Weight5 --> Combine
|
||||
|
||||
Combine --> Normalize{Normalize?}
|
||||
Normalize -->|Yes| NormalizeScore[Normalize Combined Score]
|
||||
Normalize -->|No| FinalScore[Final Score]
|
||||
NormalizeScore --> FinalScore
|
||||
|
||||
FinalScore --> Stats[Update Statistics]
|
||||
Stats --> End([End])
|
||||
|
||||
classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
|
||||
classDef scorer fill:#fff59d,stroke:#000,stroke-width:2px;
|
||||
classDef calc fill:#a5d6a7,stroke:#000,stroke-width:2px;
|
||||
classDef decision fill:#ef9a9a,stroke:#000,stroke-width:2px;
|
||||
|
||||
class Start,End calc;
|
||||
class Keywords,Path,Content,Fresh,Domain scorer;
|
||||
class KeywordScore,PathScore,ContentScore,FreshScore,DomainScore process;
|
||||
class Normalize decision;
|
||||
```
|
||||
|
||||
## URL Scorers
|
||||
|
||||
URL scorers help prioritize which URLs to crawl first. Higher scores indicate higher priority.
|
||||
|
||||
### Available Scorers
|
||||
|
||||
1. **Keyword Relevance Scorer**
|
||||
```python
|
||||
keyword_scorer = KeywordRelevanceScorer(
|
||||
keywords=["python", "programming"],
|
||||
weight=1.0,
|
||||
case_sensitive=False
|
||||
)
|
||||
```
|
||||
- Score based on keyword matches
|
||||
- Case sensitivity options
|
||||
- Weighted scoring
|
||||
|
||||
2. **Path Depth Scorer**
|
||||
```python
|
||||
path_scorer = PathDepthScorer(
|
||||
optimal_depth=3, # Preferred URL depth
|
||||
weight=0.7
|
||||
)
|
||||
```
|
||||
- Score based on URL path depth
|
||||
- Configurable optimal depth
|
||||
- Diminishing returns for deeper paths
|
||||
|
||||
3. **Content Type Scorer**
|
||||
```python
|
||||
content_scorer = ContentTypeScorer({
|
||||
r'\.html$': 1.0,
|
||||
r'\.pdf$': 0.8,
|
||||
r'\.xml$': 0.6
|
||||
})
|
||||
```
|
||||
- Score based on file types
|
||||
- Configurable type weights
|
||||
- Pattern matching support
|
||||
|
||||
4. **Freshness Scorer**
|
||||
```python
|
||||
freshness_scorer = FreshnessScorer(weight=0.9)
|
||||
```
|
||||
- Score based on date indicators in URLs
|
||||
- Multiple date format support
|
||||
- Recency weighting
|
||||
|
||||
5. **Domain Authority Scorer**
|
||||
```python
|
||||
authority_scorer = DomainAuthorityScorer({
|
||||
"python.org": 1.0,
|
||||
"github.com": 0.9,
|
||||
"medium.com": 0.7
|
||||
})
|
||||
```
|
||||
- Score based on domain importance
|
||||
- Configurable domain weights
|
||||
- Default weight for unknown domains
|
||||
|
||||
### Combining Scorers
|
||||
|
||||
```python
|
||||
# Create a composite scorer
|
||||
composite_scorer = CompositeScorer([
|
||||
KeywordRelevanceScorer(["python"], weight=1.0),
|
||||
PathDepthScorer(optimal_depth=2, weight=0.7),
|
||||
FreshnessScorer(weight=0.8)
|
||||
], normalize=True)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Filter Configuration
|
||||
|
||||
1. **Start Restrictive**
|
||||
```python
|
||||
# Begin with strict filters
|
||||
filter_chain = FilterChain([
|
||||
DomainFilter(allowed_domains=["example.com"]),
|
||||
ContentTypeFilter(["text/html"])
|
||||
])
|
||||
```
|
||||
|
||||
2. **Layer Filters**
|
||||
```python
|
||||
# Add more specific filters
|
||||
filter_chain.add_filter(
|
||||
URLPatternFilter(["*/article/*", "*/blog/*"])
|
||||
)
|
||||
```
|
||||
|
||||
3. **Monitor Filter Statistics**
|
||||
```python
|
||||
# Check filter performance
|
||||
for filter in filter_chain.filters:
|
||||
print(f"{filter.name}: {filter.stats.rejected_urls} rejected")
|
||||
```
|
||||
|
||||
### Scorer Configuration
|
||||
|
||||
1. **Balance Weights**
|
||||
```python
|
||||
# Balanced scoring configuration
|
||||
scorer = create_balanced_scorer()
|
||||
```
|
||||
|
||||
2. **Customize for Content**
|
||||
```python
|
||||
# News site configuration
|
||||
news_scorer = CompositeScorer([
|
||||
KeywordRelevanceScorer(["news", "article"], weight=1.0),
|
||||
FreshnessScorer(weight=1.0),
|
||||
PathDepthScorer(optimal_depth=2, weight=0.5)
|
||||
])
|
||||
```
|
||||
|
||||
3. **Monitor Scoring Statistics**
|
||||
```python
|
||||
# Check scoring distribution
|
||||
print(f"Average score: {scorer.stats.average_score}")
|
||||
print(f"Score range: {scorer.stats.min_score} - {scorer.stats.max_score}")
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### Blog Crawling
|
||||
```python
|
||||
blog_config = {
|
||||
'filters': FilterChain([
|
||||
URLPatternFilter(["*/blog/*", "*/post/*"]),
|
||||
ContentTypeFilter(["text/html"])
|
||||
]),
|
||||
'scorer': CompositeScorer([
|
||||
FreshnessScorer(weight=1.0),
|
||||
KeywordRelevanceScorer(["blog", "article"], weight=0.8)
|
||||
])
|
||||
}
|
||||
```
|
||||
|
||||
### Documentation Sites
|
||||
```python
|
||||
docs_config = {
|
||||
'filters': FilterChain([
|
||||
URLPatternFilter(["*/docs/*", "*/guide/*"]),
|
||||
ContentTypeFilter(["text/html", "application/pdf"])
|
||||
]),
|
||||
'scorer': CompositeScorer([
|
||||
PathDepthScorer(optimal_depth=3, weight=1.0),
|
||||
KeywordRelevanceScorer(["guide", "tutorial"], weight=0.9)
|
||||
])
|
||||
}
|
||||
```
|
||||
|
||||
### E-commerce Sites
|
||||
```python
|
||||
ecommerce_config = {
|
||||
'filters': FilterChain([
|
||||
URLPatternFilter(["*/product/*", "*/category/*"]),
|
||||
DomainFilter(blocked_domains=["ads.*", "tracker.*"])
|
||||
]),
|
||||
'scorer': CompositeScorer([
|
||||
PathDepthScorer(optimal_depth=2, weight=1.0),
|
||||
ContentTypeScorer({
|
||||
r'/product/': 1.0,
|
||||
r'/category/': 0.8
|
||||
})
|
||||
])
|
||||
}
|
||||
```
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Custom Filters
|
||||
```python
|
||||
class CustomFilter(URLFilter):
|
||||
def apply(self, url: str) -> bool:
|
||||
# Your custom filtering logic
|
||||
return True
|
||||
```
|
||||
|
||||
### Custom Scorers
|
||||
```python
|
||||
class CustomScorer(URLScorer):
|
||||
def _calculate_score(self, url: str) -> float:
|
||||
# Your custom scoring logic
|
||||
return 1.0
|
||||
```
|
||||
|
||||
For more examples, check our [example repository](https://github.com/example/crawl4ai/examples).
|
||||
206
docs/scrapper/how_to_use.md
Normal file
206
docs/scrapper/how_to_use.md
Normal file
@@ -0,0 +1,206 @@
|
||||
# Scraper Examples Guide
|
||||
|
||||
This guide provides two complete examples of using the crawl4ai scraper: a basic implementation for simple use cases and an advanced implementation showcasing all features.
|
||||
|
||||
## Basic Example
|
||||
|
||||
The basic example demonstrates a simple blog scraping scenario:
|
||||
|
||||
```python
|
||||
from crawl4ai.scraper import AsyncWebScraper, BFSScraperStrategy, FilterChain
|
||||
|
||||
# Create simple filter chain
|
||||
filter_chain = FilterChain([
|
||||
URLPatternFilter("*/blog/*"),
|
||||
ContentTypeFilter(["text/html"])
|
||||
])
|
||||
|
||||
# Initialize strategy
|
||||
strategy = BFSScraperStrategy(
|
||||
max_depth=2,
|
||||
filter_chain=filter_chain,
|
||||
url_scorer=None,
|
||||
max_concurrent=3
|
||||
)
|
||||
|
||||
# Create and run scraper
|
||||
crawler = AsyncWebCrawler()
|
||||
scraper = AsyncWebScraper(crawler, strategy)
|
||||
result = await scraper.ascrape("https://example.com/blog/")
|
||||
```
|
||||
|
||||
### Features Demonstrated
|
||||
- Basic URL filtering
|
||||
- Simple content type filtering
|
||||
- Depth control
|
||||
- Concurrent request limiting
|
||||
- Result collection
|
||||
|
||||
## Advanced Example
|
||||
|
||||
The advanced example shows a sophisticated news site scraping setup with all features enabled:
|
||||
|
||||
```python
|
||||
# Create comprehensive filter chain
|
||||
filter_chain = FilterChain([
|
||||
DomainFilter(
|
||||
allowed_domains=["example.com"],
|
||||
blocked_domains=["ads.example.com"]
|
||||
),
|
||||
URLPatternFilter([
|
||||
"*/article/*",
|
||||
re.compile(r"\d{4}/\d{2}/.*")
|
||||
]),
|
||||
ContentTypeFilter(["text/html"])
|
||||
])
|
||||
|
||||
# Create intelligent scorer
|
||||
scorer = CompositeScorer([
|
||||
KeywordRelevanceScorer(
|
||||
keywords=["news", "breaking"],
|
||||
weight=1.0
|
||||
),
|
||||
PathDepthScorer(optimal_depth=3, weight=0.7),
|
||||
FreshnessScorer(weight=0.9)
|
||||
])
|
||||
|
||||
# Initialize advanced strategy
|
||||
strategy = BFSScraperStrategy(
|
||||
max_depth=4,
|
||||
filter_chain=filter_chain,
|
||||
url_scorer=scorer,
|
||||
max_concurrent=5
|
||||
)
|
||||
```
|
||||
|
||||
### Features Demonstrated
|
||||
1. **Advanced Filtering**
|
||||
- Domain filtering
|
||||
- Pattern matching
|
||||
- Content type control
|
||||
|
||||
2. **Intelligent Scoring**
|
||||
- Keyword relevance
|
||||
- Path optimization
|
||||
- Freshness priority
|
||||
|
||||
3. **Monitoring**
|
||||
- Progress tracking
|
||||
- Error handling
|
||||
- Statistics collection
|
||||
|
||||
4. **Resource Management**
|
||||
- Concurrent processing
|
||||
- Rate limiting
|
||||
- Cleanup handling
|
||||
|
||||
## Running the Examples
|
||||
|
||||
```bash
|
||||
# Basic usage
|
||||
python basic_scraper_example.py
|
||||
|
||||
# Advanced usage with logging
|
||||
PYTHONPATH=. python advanced_scraper_example.py
|
||||
```
|
||||
|
||||
## Example Output
|
||||
|
||||
### Basic Example
|
||||
```
|
||||
Crawled 15 pages:
|
||||
- https://example.com/blog/post1: 24560 bytes
|
||||
- https://example.com/blog/post2: 18920 bytes
|
||||
...
|
||||
```
|
||||
|
||||
### Advanced Example
|
||||
```
|
||||
INFO: Starting crawl of https://example.com/news/
|
||||
INFO: Processed: https://example.com/news/breaking/story1
|
||||
DEBUG: KeywordScorer: 0.85
|
||||
DEBUG: FreshnessScorer: 0.95
|
||||
INFO: Progress: 10 URLs processed
|
||||
...
|
||||
INFO: Scraping completed:
|
||||
INFO: - URLs processed: 50
|
||||
INFO: - Errors: 2
|
||||
INFO: - Total content size: 1240.50 KB
|
||||
```
|
||||
|
||||
## Customization
|
||||
|
||||
### Adding Custom Filters
|
||||
```python
|
||||
class CustomFilter(URLFilter):
|
||||
def apply(self, url: str) -> bool:
|
||||
# Your custom filtering logic
|
||||
return True
|
||||
|
||||
filter_chain.add_filter(CustomFilter())
|
||||
```
|
||||
|
||||
### Custom Scoring Logic
|
||||
```python
|
||||
class CustomScorer(URLScorer):
|
||||
def _calculate_score(self, url: str) -> float:
|
||||
# Your custom scoring logic
|
||||
return 1.0
|
||||
|
||||
scorer = CompositeScorer([
|
||||
CustomScorer(weight=1.0),
|
||||
...
|
||||
])
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start Simple**
|
||||
- Begin with basic filtering
|
||||
- Add features incrementally
|
||||
- Test thoroughly at each step
|
||||
|
||||
2. **Monitor Performance**
|
||||
- Watch memory usage
|
||||
- Track processing times
|
||||
- Adjust concurrency as needed
|
||||
|
||||
3. **Handle Errors**
|
||||
- Implement proper error handling
|
||||
- Log important events
|
||||
- Track error statistics
|
||||
|
||||
4. **Optimize Resources**
|
||||
- Set appropriate delays
|
||||
- Limit concurrent requests
|
||||
- Use streaming for large crawls
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
Common issues and solutions:
|
||||
|
||||
1. **Too Many Requests**
|
||||
```python
|
||||
strategy = BFSScraperStrategy(
|
||||
max_concurrent=3, # Reduce concurrent requests
|
||||
min_crawl_delay=2 # Increase delay between requests
|
||||
)
|
||||
```
|
||||
|
||||
2. **Memory Issues**
|
||||
```python
|
||||
# Use streaming mode for large crawls
|
||||
async for result in scraper.ascrape(url, stream=True):
|
||||
process_result(result)
|
||||
```
|
||||
|
||||
3. **Missing Content**
|
||||
```python
|
||||
# Check your filter chain
|
||||
filter_chain = FilterChain([
|
||||
URLPatternFilter("*"), # Broaden patterns
|
||||
ContentTypeFilter(["*"]) # Accept all content
|
||||
])
|
||||
```
|
||||
|
||||
For more examples and use cases, visit our [GitHub repository](https://github.com/example/crawl4ai/examples).
|
||||
184
docs/scrapper/scraper_quickstart.py
Normal file
184
docs/scrapper/scraper_quickstart.py
Normal file
@@ -0,0 +1,184 @@
|
||||
# basic_scraper_example.py
|
||||
from crawl4ai.scraper import (
|
||||
AsyncWebScraper,
|
||||
BFSScraperStrategy,
|
||||
FilterChain,
|
||||
URLPatternFilter,
|
||||
ContentTypeFilter
|
||||
)
|
||||
from crawl4ai.async_webcrawler import AsyncWebCrawler
|
||||
|
||||
async def basic_scraper_example():
|
||||
"""
|
||||
Basic example: Scrape a blog site for articles
|
||||
- Crawls only HTML pages
|
||||
- Stays within the blog section
|
||||
- Collects all results at once
|
||||
"""
|
||||
# Create a simple filter chain
|
||||
filter_chain = FilterChain([
|
||||
# Only crawl pages within the blog section
|
||||
URLPatternFilter("*/blog/*"),
|
||||
# Only process HTML pages
|
||||
ContentTypeFilter(["text/html"])
|
||||
])
|
||||
|
||||
# Initialize the strategy with basic configuration
|
||||
strategy = BFSScraperStrategy(
|
||||
max_depth=2, # Only go 2 levels deep
|
||||
filter_chain=filter_chain,
|
||||
url_scorer=None, # Use default scoring
|
||||
max_concurrent=3 # Limit concurrent requests
|
||||
)
|
||||
|
||||
# Create the crawler and scraper
|
||||
crawler = AsyncWebCrawler()
|
||||
scraper = AsyncWebScraper(crawler, strategy)
|
||||
|
||||
# Start scraping
|
||||
try:
|
||||
result = await scraper.ascrape("https://example.com/blog/")
|
||||
|
||||
# Process results
|
||||
print(f"Crawled {len(result.crawled_urls)} pages:")
|
||||
for url, data in result.extracted_data.items():
|
||||
print(f"- {url}: {len(data.html)} bytes")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error during scraping: {e}")
|
||||
|
||||
# advanced_scraper_example.py
|
||||
import logging
|
||||
from crawl4ai.scraper import (
|
||||
AsyncWebScraper,
|
||||
BFSScraperStrategy,
|
||||
FilterChain,
|
||||
URLPatternFilter,
|
||||
ContentTypeFilter,
|
||||
DomainFilter,
|
||||
KeywordRelevanceScorer,
|
||||
PathDepthScorer,
|
||||
FreshnessScorer,
|
||||
CompositeScorer
|
||||
)
|
||||
from crawl4ai.async_webcrawler import AsyncWebCrawler
|
||||
|
||||
async def advanced_scraper_example():
|
||||
"""
|
||||
Advanced example: Intelligent news site scraping
|
||||
- Uses all filter types
|
||||
- Implements sophisticated scoring
|
||||
- Streams results
|
||||
- Includes monitoring and logging
|
||||
"""
|
||||
# Set up logging
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger("advanced_scraper")
|
||||
|
||||
# Create sophisticated filter chain
|
||||
filter_chain = FilterChain([
|
||||
# Domain control
|
||||
DomainFilter(
|
||||
allowed_domains=["example.com", "blog.example.com"],
|
||||
blocked_domains=["ads.example.com", "tracker.example.com"]
|
||||
),
|
||||
# URL patterns
|
||||
URLPatternFilter([
|
||||
"*/article/*",
|
||||
"*/news/*",
|
||||
"*/blog/*",
|
||||
re.compile(r"\d{4}/\d{2}/.*") # Date-based URLs
|
||||
]),
|
||||
# Content types
|
||||
ContentTypeFilter([
|
||||
"text/html",
|
||||
"application/xhtml+xml"
|
||||
])
|
||||
])
|
||||
|
||||
# Create composite scorer
|
||||
scorer = CompositeScorer([
|
||||
# Prioritize by keywords
|
||||
KeywordRelevanceScorer(
|
||||
keywords=["news", "breaking", "update", "latest"],
|
||||
weight=1.0
|
||||
),
|
||||
# Prefer optimal URL structure
|
||||
PathDepthScorer(
|
||||
optimal_depth=3,
|
||||
weight=0.7
|
||||
),
|
||||
# Prioritize fresh content
|
||||
FreshnessScorer(weight=0.9)
|
||||
])
|
||||
|
||||
# Initialize strategy with advanced configuration
|
||||
strategy = BFSScraperStrategy(
|
||||
max_depth=4,
|
||||
filter_chain=filter_chain,
|
||||
url_scorer=scorer,
|
||||
max_concurrent=5,
|
||||
min_crawl_delay=1
|
||||
)
|
||||
|
||||
# Create crawler and scraper
|
||||
crawler = AsyncWebCrawler()
|
||||
scraper = AsyncWebScraper(crawler, strategy)
|
||||
|
||||
# Track statistics
|
||||
stats = {
|
||||
'processed': 0,
|
||||
'errors': 0,
|
||||
'total_size': 0
|
||||
}
|
||||
|
||||
try:
|
||||
# Use streaming mode
|
||||
async for result in scraper.ascrape("https://example.com/news/", stream=True):
|
||||
stats['processed'] += 1
|
||||
|
||||
if result.success:
|
||||
stats['total_size'] += len(result.html)
|
||||
logger.info(f"Processed: {result.url}")
|
||||
|
||||
# Print scoring information
|
||||
for scorer_name, score in result.scores.items():
|
||||
logger.debug(f"{scorer_name}: {score:.2f}")
|
||||
else:
|
||||
stats['errors'] += 1
|
||||
logger.error(f"Failed to process {result.url}: {result.error_message}")
|
||||
|
||||
# Log progress regularly
|
||||
if stats['processed'] % 10 == 0:
|
||||
logger.info(f"Progress: {stats['processed']} URLs processed")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Scraping error: {e}")
|
||||
|
||||
finally:
|
||||
# Print final statistics
|
||||
logger.info("Scraping completed:")
|
||||
logger.info(f"- URLs processed: {stats['processed']}")
|
||||
logger.info(f"- Errors: {stats['errors']}")
|
||||
logger.info(f"- Total content size: {stats['total_size'] / 1024:.2f} KB")
|
||||
|
||||
# Print filter statistics
|
||||
for filter_ in filter_chain.filters:
|
||||
logger.info(f"{filter_.name} stats:")
|
||||
logger.info(f"- Passed: {filter_.stats.passed_urls}")
|
||||
logger.info(f"- Rejected: {filter_.stats.rejected_urls}")
|
||||
|
||||
# Print scorer statistics
|
||||
logger.info("Scoring statistics:")
|
||||
logger.info(f"- Average score: {scorer.stats.average_score:.2f}")
|
||||
logger.info(f"- Score range: {scorer.stats.min_score:.2f} - {scorer.stats.max_score:.2f}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
import asyncio
|
||||
|
||||
# Run basic example
|
||||
print("Running basic scraper example...")
|
||||
asyncio.run(basic_scraper_example())
|
||||
|
||||
print("\nRunning advanced scraper example...")
|
||||
asyncio.run(advanced_scraper_example())
|
||||
184
tests/test_scraper.py
Normal file
184
tests/test_scraper.py
Normal file
@@ -0,0 +1,184 @@
|
||||
# basic_scraper_example.py
|
||||
from crawl4ai.scraper import (
|
||||
AsyncWebScraper,
|
||||
BFSScraperStrategy,
|
||||
FilterChain,
|
||||
URLPatternFilter,
|
||||
ContentTypeFilter
|
||||
)
|
||||
from crawl4ai.async_webcrawler import AsyncWebCrawler
|
||||
|
||||
async def basic_scraper_example():
|
||||
"""
|
||||
Basic example: Scrape a blog site for articles
|
||||
- Crawls only HTML pages
|
||||
- Stays within the blog section
|
||||
- Collects all results at once
|
||||
"""
|
||||
# Create a simple filter chain
|
||||
filter_chain = FilterChain([
|
||||
# Only crawl pages within the blog section
|
||||
URLPatternFilter("*/blog/*"),
|
||||
# Only process HTML pages
|
||||
ContentTypeFilter(["text/html"])
|
||||
])
|
||||
|
||||
# Initialize the strategy with basic configuration
|
||||
strategy = BFSScraperStrategy(
|
||||
max_depth=2, # Only go 2 levels deep
|
||||
filter_chain=filter_chain,
|
||||
url_scorer=None, # Use default scoring
|
||||
max_concurrent=3 # Limit concurrent requests
|
||||
)
|
||||
|
||||
# Create the crawler and scraper
|
||||
crawler = AsyncWebCrawler()
|
||||
scraper = AsyncWebScraper(crawler, strategy)
|
||||
|
||||
# Start scraping
|
||||
try:
|
||||
result = await scraper.ascrape("https://example.com/blog/")
|
||||
|
||||
# Process results
|
||||
print(f"Crawled {len(result.crawled_urls)} pages:")
|
||||
for url, data in result.extracted_data.items():
|
||||
print(f"- {url}: {len(data.html)} bytes")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error during scraping: {e}")
|
||||
|
||||
# advanced_scraper_example.py
|
||||
import logging
|
||||
from crawl4ai.scraper import (
|
||||
AsyncWebScraper,
|
||||
BFSScraperStrategy,
|
||||
FilterChain,
|
||||
URLPatternFilter,
|
||||
ContentTypeFilter,
|
||||
DomainFilter,
|
||||
KeywordRelevanceScorer,
|
||||
PathDepthScorer,
|
||||
FreshnessScorer,
|
||||
CompositeScorer
|
||||
)
|
||||
from crawl4ai.async_webcrawler import AsyncWebCrawler
|
||||
|
||||
async def advanced_scraper_example():
|
||||
"""
|
||||
Advanced example: Intelligent news site scraping
|
||||
- Uses all filter types
|
||||
- Implements sophisticated scoring
|
||||
- Streams results
|
||||
- Includes monitoring and logging
|
||||
"""
|
||||
# Set up logging
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger("advanced_scraper")
|
||||
|
||||
# Create sophisticated filter chain
|
||||
filter_chain = FilterChain([
|
||||
# Domain control
|
||||
DomainFilter(
|
||||
allowed_domains=["example.com", "blog.example.com"],
|
||||
blocked_domains=["ads.example.com", "tracker.example.com"]
|
||||
),
|
||||
# URL patterns
|
||||
URLPatternFilter([
|
||||
"*/article/*",
|
||||
"*/news/*",
|
||||
"*/blog/*",
|
||||
re.compile(r"\d{4}/\d{2}/.*") # Date-based URLs
|
||||
]),
|
||||
# Content types
|
||||
ContentTypeFilter([
|
||||
"text/html",
|
||||
"application/xhtml+xml"
|
||||
])
|
||||
])
|
||||
|
||||
# Create composite scorer
|
||||
scorer = CompositeScorer([
|
||||
# Prioritize by keywords
|
||||
KeywordRelevanceScorer(
|
||||
keywords=["news", "breaking", "update", "latest"],
|
||||
weight=1.0
|
||||
),
|
||||
# Prefer optimal URL structure
|
||||
PathDepthScorer(
|
||||
optimal_depth=3,
|
||||
weight=0.7
|
||||
),
|
||||
# Prioritize fresh content
|
||||
FreshnessScorer(weight=0.9)
|
||||
])
|
||||
|
||||
# Initialize strategy with advanced configuration
|
||||
strategy = BFSScraperStrategy(
|
||||
max_depth=4,
|
||||
filter_chain=filter_chain,
|
||||
url_scorer=scorer,
|
||||
max_concurrent=5,
|
||||
min_crawl_delay=1
|
||||
)
|
||||
|
||||
# Create crawler and scraper
|
||||
crawler = AsyncWebCrawler()
|
||||
scraper = AsyncWebScraper(crawler, strategy)
|
||||
|
||||
# Track statistics
|
||||
stats = {
|
||||
'processed': 0,
|
||||
'errors': 0,
|
||||
'total_size': 0
|
||||
}
|
||||
|
||||
try:
|
||||
# Use streaming mode
|
||||
async for result in scraper.ascrape("https://example.com/news/", stream=True):
|
||||
stats['processed'] += 1
|
||||
|
||||
if result.success:
|
||||
stats['total_size'] += len(result.html)
|
||||
logger.info(f"Processed: {result.url}")
|
||||
|
||||
# Print scoring information
|
||||
for scorer_name, score in result.scores.items():
|
||||
logger.debug(f"{scorer_name}: {score:.2f}")
|
||||
else:
|
||||
stats['errors'] += 1
|
||||
logger.error(f"Failed to process {result.url}: {result.error_message}")
|
||||
|
||||
# Log progress regularly
|
||||
if stats['processed'] % 10 == 0:
|
||||
logger.info(f"Progress: {stats['processed']} URLs processed")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Scraping error: {e}")
|
||||
|
||||
finally:
|
||||
# Print final statistics
|
||||
logger.info("Scraping completed:")
|
||||
logger.info(f"- URLs processed: {stats['processed']}")
|
||||
logger.info(f"- Errors: {stats['errors']}")
|
||||
logger.info(f"- Total content size: {stats['total_size'] / 1024:.2f} KB")
|
||||
|
||||
# Print filter statistics
|
||||
for filter_ in filter_chain.filters:
|
||||
logger.info(f"{filter_.name} stats:")
|
||||
logger.info(f"- Passed: {filter_.stats.passed_urls}")
|
||||
logger.info(f"- Rejected: {filter_.stats.rejected_urls}")
|
||||
|
||||
# Print scorer statistics
|
||||
logger.info("Scoring statistics:")
|
||||
logger.info(f"- Average score: {scorer.stats.average_score:.2f}")
|
||||
logger.info(f"- Score range: {scorer.stats.min_score:.2f} - {scorer.stats.max_score:.2f}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
import asyncio
|
||||
|
||||
# Run basic example
|
||||
print("Running basic scraper example...")
|
||||
asyncio.run(basic_scraper_example())
|
||||
|
||||
print("\nRunning advanced scraper example...")
|
||||
asyncio.run(advanced_scraper_example())
|
||||
Reference in New Issue
Block a user