Files
crawl4ai/crawl4ai/content_scraping_strategy.py
Nasrin 7cac008c10 Release/v0.7.6 (#1556)
* fix(docker-api): migrate to modern datetime library API

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

* Fix examples in README.md

* feat(docker): add user-provided hooks support to Docker API

Implements comprehensive hooks functionality allowing users to provide custom Python
functions as strings that execute at specific points in the crawling pipeline.

Key Features:
- Support for all 8 crawl4ai hook points:
  • on_browser_created: Initialize browser settings
  • on_page_context_created: Configure page context
  • before_goto: Pre-navigation setup
  • after_goto: Post-navigation processing
  • on_user_agent_updated: User agent modification handling
  • on_execution_started: Crawl execution initialization
  • before_retrieve_html: Pre-extraction processing
  • before_return_html: Final HTML processing

Implementation Details:
- Created UserHookManager for validation, compilation, and safe execution
- Added IsolatedHookWrapper for error isolation and timeout protection
- AST-based validation ensures code structure correctness
- Sandboxed execution with restricted builtins for security
- Configurable timeout (1-120 seconds) prevents infinite loops
- Comprehensive error handling ensures hooks don't crash main process
- Execution tracking with detailed statistics and logging

API Changes:
- Added HookConfig schema with code and timeout fields
- Extended CrawlRequest with optional hooks parameter
- Added /hooks/info endpoint for hook discovery
- Updated /crawl and /crawl/stream endpoints to support hooks

Safety Features:
- Malformed hooks return clear validation errors
- Hook errors are isolated and reported without stopping crawl
- Execution statistics track success/failure/timeout rates
- All hook results are JSON-serializable

Testing:
- Comprehensive test suite covering all 8 hooks
- Error handling and timeout scenarios validated
- Authentication, performance, and content extraction examples
- 100% success rate in production testing

Documentation:
- Added extensive hooks section to docker-deployment.md
- Security warnings about user-provided code risks
- Real-world examples using httpbin.org, GitHub, BBC
- Best practices and troubleshooting guide

ref #1377

* fix(deep-crawl): BestFirst priority inversion; remove pre-scoring truncation. ref #1253

  Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.

* docs: Update URL seeding examples to use proper async context managers
- Wrap all AsyncUrlSeeder usage with async context managers
- Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error

* fix(crawler): Removed the incorrect reference in browser_config variable #1310

* docs: update Docker instructions to use the latest release tag

* fix(docker): Fix LLM API key handling for multi-provider support

Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers
due to a hardcoded api_key_env fallback in config.yml. This caused authentication
errors when using non-OpenAI providers like Gemini.

Changes:
- Remove api_key_env from config.yml to let litellm handle provider-specific env vars
- Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys
- Update validate_llm_provider() to trust litellm's built-in key detection
- Update documentation to reflect the new automatic key handling

The fix leverages litellm's existing capability to automatically find the correct
environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.)
without manual configuration.

ref #1291

* docs: update adaptive crawler docs and cache defaults; remove deprecated examples (#1330)
- Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy)
- Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library
- Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs
- Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED

* fix(utils): Improve URL normalization by avoiding quote/unquote to preserve '+' signs. ref #1332

* feat: Add comprehensive website to API example with frontend

This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface.

Core Functionality
- AI-powered web scraping with plain English queries
- Dual scraping approaches: Schema-based (faster) and LLM-based (flexible)
- Intelligent schema caching for improved performance
- Custom LLM model support with API key management
- Automatic duplicate request prevention

Modern Frontend Interface
- Minimalist black-and-white design inspired by modern web apps
- Responsive layout with smooth animations and transitions
- Three main pages: Scrape Data, Models Management, API Request History
- Real-time results display with JSON formatting
- Copy-to-clipboard functionality for extracted data
- Toast notifications for user feedback
- Auto-scroll to results when scraping starts

Model Management System
- Web-based model configuration interface
- Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.)
- Simplified configuration requiring only provider and API token
- Add, list, and delete model configurations
- Secure storage of API keys in local JSON files

API Request History
- Automatic saving of all API requests and responses
- Display of request history with URL, query, and cURL commands
- Duplicate prevention (same URL + query combinations)
- Request deletion functionality
- Clean, simplified display focusing on essential information

Technical Implementation

Backend (FastAPI)
- RESTful API with comprehensive endpoints
- Pydantic models for request/response validation
- Async web scraping with crawl4ai library
- Error handling with detailed error messages
- File-based storage for models and request history

Frontend (Vanilla JS/CSS/HTML)
- No framework dependencies - pure HTML, CSS, JavaScript
- Modern CSS Grid and Flexbox layouts
- Custom dropdown styling with SVG arrows
- Responsive design for mobile and desktop
- Smooth scrolling and animations

Core Library Integration
- WebScraperAgent class for orchestration
- ModelConfig class for LLM configuration management
- Schema generation and caching system
- LLM extraction strategy support
- Browser configuration with headless mode

* fix(dependencies): add cssselect to project dependencies

Fixes bug reported in issue #1405
[Bug]: Excluded selector (excluded_selector) doesn't work

This commit reintroduces the cssselect library which was removed by PR (https://github.com/unclecode/crawl4ai/pull/1368) and merged via (437395e490).

Integration tested against 0.7.4 Docker container. Reintroducing cssselector package eliminated errors seen in logs and excluded_selector functionality was restored.

Refs: #1405

* fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy (ref #1419)

  - Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params
  - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization
  - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors
  - Ensure filter chains work correctly with Docker client and REST API

  The issue occurred because:
  1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization
  2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors

  Changes:
  - async_configs.py: Comment out __slots__ serialization logic (lines 100-109)
  - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes
  - models.py: Convert property descriptors to strings in model_dump() instead of including them directly

* fix(logger): ensure logger is a Logger instance in crawling strategies. ref #1437

* feat(docker): Add temperature and base_url parameters for LLM configuration. ref #1035

  Implement hierarchical configuration for LLM parameters with support for:
  - Temperature control (0.0-2.0) to adjust response creativity
  - Custom base_url for proxy servers and alternative endpoints
  - 4-tier priority: request params > provider env > global env > defaults

  Add helper functions in utils.py, update API schemas and handlers,
  support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.),
  and provide comprehensive documentation with examples.

* feat(docker): improve docker error handling
- Return comprehensive error messages along with status codes for api internal errors.
- Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints
- Add sanitization to ensure fit_html is always JSON-serializable (string or None)
- Add comprehensive error handling test suite.

* #1375 : refactor(proxy) Deprecate 'proxy' parameter in BrowserConfig and enhance proxy string parsing

- Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials.
- Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility.
- Added warnings for deprecated usage and clarified behavior when both parameters are provided.
- Updated documentation and tests to reflect changes in proxy configuration handling.

* Remove deprecated test for 'proxy' parameter in BrowserConfig and update .gitignore to include test_scripts directory.

* feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling. Ref #1410

Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.

* feat: update documentation for preserve_https_for_internal_links. ref #1410

* fix: drop Python 3.9 support and require Python >=3.10.
The library no longer supports Python 3.9 and so it was important to drop all references to python 3.9.
Following changes have been made:
- pyproject.toml: set requires-python to ">=3.10"; remove 3.9 classifier
- setup.py: set python_requires to ">=3.10"; remove 3.9 classifier
- docs: update Python version mentions
  - deploy/docker/c4ai-doc-context.md: options -> 3.10, 3.11, 3.12, 3.13

* issue #1329 refactor(crawler): move unwanted properties to CrawlerRunConfig class

* fix(auth): fixed Docker JWT authentication. ref #1442

* remove: delete unused yoyo snapshot subproject

* fix: raise error on last attempt failure in perform_completion_with_backoff. ref #989

* Commit without API

* fix: update option labels in request builder for clarity

* fix: allow custom LLM providers for adaptive crawler embedding config. ref: #1291

  - Change embedding_llm_config from Dict to Union[LLMConfig, Dict] for type safety
  - Add backward-compatible conversion property _embedding_llm_config_dict
  - Replace all hardcoded OpenAI embedding configs with configurable options
  - Fix LLMConfig object attribute access in query expansion logic
  - Add comprehensive example demonstrating multiple provider configurations
  - Update documentation with both LLMConfig object and dictionary usage patterns

  Users can now specify any LLM provider for query expansion in embedding strategy:
  - New: embedding_llm_config=LLMConfig(provider='anthropic/claude-3', api_token='key')
  - Old: embedding_llm_config={'provider': 'openai/gpt-4', 'api_token': 'key'} (still works)

* refactor(BrowserConfig): change deprecation warning for 'proxy' parameter to UserWarning

* feat(StealthAdapter): fix stealth features for Playwright integration. ref #1481

* #1505 fix(api): update config handling to only set base config if not provided by user

* fix(docker-deployment): replace console.log with print for metadata extraction

* Release v0.7.5: The Update

- Updated version to 0.7.5
- Added comprehensive demo and release notes
- Updated documentation

* refactor(release): remove memory management section for cleaner documentation. ref #1443

* feat(docs): add brand book and page copy functionality

- Add comprehensive brand book with color system, typography, components
- Add page copy dropdown with markdown copy/view functionality
- Update mkdocs.yml with new assets and branding navigation
- Use terminal-style ASCII icons and condensed menu design

* Update gitignore add local scripts folder

* fix: remove this import as it causes python to treat "json" as a variable in the except block

* fix: always return a list, even if we catch an exception

* feat(marketplace): Add Crawl4AI marketplace with secure configuration

- Implement marketplace frontend and admin dashboard
- Add FastAPI backend with environment-based configuration
- Use .env file for secrets management
- Include data generation scripts
- Add proper CORS configuration
- Remove hardcoded password from admin login
- Update gitignore for security

* fix(marketplace): Update URLs to use /marketplace path and relative API endpoints

- Change API_BASE to relative '/api' for production
- Move marketplace to /marketplace instead of /marketplace/frontend
- Update MkDocs navigation
- Fix logo path in marketplace index

* fix(docs): hide copy menu on non-markdown pages

* feat(marketplace): add sponsor logo uploads

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

* feat(docs): add chatgpt quick link to page actions

* fix(marketplace): align admin api with backend endpoints

* fix(marketplace): isolate api under marketplace prefix

* fix(marketplace): resolve app detail page routing and styling issues

- Fixed JavaScript errors from missing HTML elements (install-code, usage-code, integration-code)
- Added missing CSS classes for tabs, overview layout, sidebar, and integration content
- Fixed tab navigation to display horizontally in single line
- Added proper padding to tab content sections (removed from container, added to content)
- Fixed tab selector from .nav-tab to .tab-btn to match HTML structure
- Added sidebar styling with stats grid and metadata display
- Improved responsive design with mobile-friendly tab scrolling
- Fixed code block positioning for copy buttons
- Removed margin from first headings to prevent extra spacing
- Added null checks for DOM elements in JavaScript to prevent errors

These changes resolve the routing issue where clicking on apps caused page redirects,
and fix the broken layout where CSS was not properly applied to the app detail page.

* fix(marketplace): prevent hero image overflow and secondary card stretching

- Fixed hero image to 200px height with min/max constraints
- Added object-fit: cover to hero-image img elements
- Changed secondary-featured align-items from stretch to flex-start
- Fixed secondary-card height to 118px (no flex: 1 stretching)
- Updated responsive grid layouts for wider screens
- Added flex: 1 to hero-content for better content distribution

These changes ensure a rigid, predictable layout that prevents:
1. Large images from pushing text content down
2. Single secondary cards from stretching to fill entire height

* feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377

   Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)

* feat: Add hooks utility for function-based hooks with Docker client integration. ref #1377

   Add hooks_to_string() utility function that converts Python function objects
   to string representations for the Docker API, enabling developers to write hooks
   as regular Python functions instead of strings.

   Core Changes:
   - New hooks_to_string() utility in crawl4ai/utils.py using inspect.getsource()
   - Docker client now accepts both function objects and strings for hooks
   - Automatic detection and conversion in Crawl4aiDockerClient._prepare_request()
   - New hooks and hooks_timeout parameters in client.crawl() method

   Documentation:
   - Docker client examples with function-based hooks (docs/examples/docker_client_hooks_example.py)
   - Updated main Docker deployment guide with comprehensive hooks section
   - Added unit tests for hooks utility (tests/docker/test_hooks_utility.py)

* fix(docs): clarify Docker Hooks System with function-based API in README

* docs: Add demonstration files for v0.7.5 release, showcasing the new Docker Hooks System and all other features.

* docs: Update 0.7.5 video walkthrough

* docs: add complete SDK reference documentation

Add comprehensive single-page SDK reference combining:
- Installation & Setup
- Quick Start
- Core API (AsyncWebCrawler, arun, arun_many, CrawlResult)
- Configuration (BrowserConfig, CrawlerConfig, Parameters)
- Crawling Patterns
- Content Processing (Markdown, Fit Markdown, Selection, Interaction, Link & Media)
- Extraction Strategies (LLM and No-LLM)
- Advanced Features (Session Management, Hooks & Auth)

Generated using scripts/generate_sdk_docs.py in ultra-dense mode
optimized for AI assistant consumption.

Stats: 23K words, 185 code blocks, 220KB

* feat: add AI assistant skill package for Crawl4AI

- Create comprehensive skill package for AI coding assistants
- Include complete SDK reference (23K words, v0.7.4)
- Add three extraction scripts (basic, batch, pipeline)
- Implement version tracking in skill and scripts
- Add prominent download section on homepage
- Place skill in docs/assets for web distribution

The skill enables AI assistants like Claude, Cursor, and Windsurf
to effectively use Crawl4AI with optimized workflows for markdown
generation and data extraction.

* fix: remove non-existent wiki link and clarify skill usage instructions

* fix: update Crawl4AI skill with corrected parameters and examples

- Fixed CrawlerConfig → CrawlerRunConfig throughout
- Fixed parameter names (timeout → page_timeout, store_html removed)
- Fixed schema format (selector → baseSelector)
- Corrected proxy configuration (in BrowserConfig, not CrawlerRunConfig)
- Fixed fit_markdown usage with content filters
- Added comprehensive references to docs/examples/ directory
- Created safe packaging script to avoid root directory pollution
- All scripts tested and verified working

* fix: thoroughly verify and fix all Crawl4AI skill examples

- Cross-checked every section against actual docs
- Fixed BM25ContentFilter parameters (user_query, bm25_threshold)
- Removed incorrect wait_for selector from basic example
- Added comprehensive test suite (4 test files)
- All examples now tested and verified working
- Tests validate: basic crawling, markdown generation, data extraction, advanced patterns
- Package size: 76.6 KB (includes tests for future validation)

* feat(ci): split release pipeline and add Docker caching

- Split release.yml into PyPI/GitHub release and Docker workflows
- Add GitHub Actions cache for Docker builds (10-15x faster rebuilds)
- Implement dual-trigger for docker-release.yml (auto + manual)
- Add comprehensive workflow documentation in .github/workflows/docs/
- Backup original workflow as release.yml.backup

* feat: add webhook notifications for crawl job completion

Implements webhook support for the crawl job API to eliminate polling requirements.

Changes:
- Added WebhookConfig and WebhookPayload schemas to schemas.py
- Created webhook.py with WebhookDeliveryService class
- Integrated webhook notifications in api.py handle_crawl_job
- Updated job.py CrawlJobPayload to accept webhook_config
- Added webhook configuration section to config.yml
- Included comprehensive usage examples in WEBHOOK_EXAMPLES.md

Features:
- Webhook notifications on job completion (success/failure)
- Configurable data inclusion in webhook payload
- Custom webhook headers support
- Global default webhook URL configuration
- Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s)
- 30-second timeout per webhook call

Usage:
POST /crawl/job with optional webhook_config:
- webhook_url: URL to receive notifications
- webhook_data_in_payload: include full results (default: false)
- webhook_headers: custom headers for authentication

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add webhook documentation to Docker README

Added comprehensive webhook section to README.md including:
- Overview of asynchronous job queue with webhooks
- Benefits and use cases
- Quick start examples
- Webhook authentication
- Global webhook configuration
- Job status polling alternative

Updated table of contents and summary to include webhook feature.
Maintains consistent tone and style with rest of README.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add webhook example for Docker deployment

Added docker_webhook_example.py demonstrating:
- Submitting crawl jobs with webhook configuration
- Flask-based webhook receiver implementation
- Three usage patterns:
  1. Webhook notification only (fetch data separately)
  2. Webhook with full data in payload
  3. Traditional polling approach for comparison

Includes comprehensive comments explaining:
- Webhook payload structure
- Authentication headers setup
- Error handling
- Production deployment tips

Example is fully functional and ready to run with Flask installed.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* test: add webhook implementation validation tests

Added comprehensive test suite to validate webhook implementation:
- Module import verification
- WebhookDeliveryService initialization
- Pydantic model validation (WebhookConfig)
- Payload construction logic
- Exponential backoff calculation
- API integration checks

All tests pass (6/6), confirming implementation is correct.

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* test: add comprehensive webhook feature test script

Added end-to-end test script that automates webhook feature testing:

Script Features (test_webhook_feature.sh):
- Automatic branch switching and dependency installation
- Redis and server startup/shutdown management
- Webhook receiver implementation
- Integration test for webhook notifications
- Comprehensive cleanup and error handling
- Returns to original branch after completion

Test Flow:
1. Fetch and checkout webhook feature branch
2. Activate venv and install dependencies
3. Start Redis and Crawl4AI server
4. Submit crawl job with webhook config
5. Verify webhook delivery and payload
6. Clean up all processes and return to original branch

Documentation:
- WEBHOOK_TEST_README.md with usage instructions
- Troubleshooting guide
- Exit codes and safety features

Usage: ./tests/test_webhook_feature.sh

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: properly serialize Pydantic HttpUrl in webhook config

Use model_dump(mode='json') instead of deprecated dict() method to ensure
Pydantic special types (HttpUrl, UUID, etc.) are properly serialized to
JSON-compatible native Python types.

This fixes webhook delivery failures caused by HttpUrl objects remaining
as Pydantic types in the webhook_config dict, which caused JSON
serialization errors and httpx request failures.

Also update mcp requirement to >=1.18.0 for compatibility.

* feat: add webhook support for /llm/job endpoint

Add comprehensive webhook notification support for the /llm/job endpoint,
following the same pattern as the existing /crawl/job implementation.

Changes:
- Add webhook_config field to LlmJobPayload model (job.py)
- Implement webhook notifications in process_llm_extraction() with 4
  notification points: success, provider validation failure, extraction
  failure, and general exceptions (api.py)
- Store webhook_config in Redis task data for job tracking
- Initialize WebhookDeliveryService with exponential backoff retry logic
Documentation:
- Add Example 6 to WEBHOOK_EXAMPLES.md showing LLM extraction with webhooks
- Update Flask webhook handler to support both crawl and llm_extraction tasks
- Add TypeScript client examples for LLM jobs
- Add comprehensive examples to docker_webhook_example.py with schema support
- Clarify data structure differences between webhook and API responses

Testing:
- Add test_llm_webhook_feature.py with 7 validation tests (all passing)
- Verify pattern consistency with /crawl/job implementation
- Add implementation guide (WEBHOOK_LLM_JOB_IMPLEMENTATION.md)

* fix: remove duplicate comma in webhook_config parameter

* fix: update Crawl4AI Docker container port from 11234 to 11235

* Release v0.7.6: The 0.7.6 Update

- Updated version to 0.7.6
- Added comprehensive demo and release notes
- Updated all documentation
- Update the veriosn in Dockerfile to 0.7.6

---------

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Nezar Ali <abu5sohaib@gmail.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: James T. Wood <jamesthomaswood@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: nafeqq-1306 <nafiquee@yahoo.com>
Co-authored-by: unclecode <unclecode@kidocode.com>
Co-authored-by: Martin Sjöborg <martin.sjoborg@quartr.se>
Co-authored-by: Martin Sjöborg <martin@sjoborg.org>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-10-22 20:41:06 +08:00

901 lines
34 KiB
Python

import re
from itertools import chain
from abc import ABC, abstractmethod
from typing import Dict, Any, Optional
from bs4 import BeautifulSoup
import asyncio
import requests
from .config import (
MIN_WORD_THRESHOLD,
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
IMAGE_SCORE_THRESHOLD,
ONLY_TEXT_ELIGIBLE_TAGS,
IMPORTANT_ATTRS,
SOCIAL_MEDIA_DOMAINS,
)
from bs4 import NavigableString, Comment
from bs4 import PageElement, Tag
from urllib.parse import urljoin
from requests.exceptions import InvalidSchema
from .utils import (
extract_metadata,
normalize_url,
is_external_url,
get_base_domain,
extract_metadata_using_lxml,
extract_page_context,
calculate_link_intrinsic_score,
)
from lxml import etree
from lxml import html as lhtml
from typing import List
from .models import ScrapingResult, MediaItem, Link, Media, Links
import copy
# Pre-compile regular expressions for Open Graph and Twitter metadata
OG_REGEX = re.compile(r"^og:")
TWITTER_REGEX = re.compile(r"^twitter:")
DIMENSION_REGEX = re.compile(r"(\d+)(\D*)")
# Function to parse srcset
def parse_srcset(s: str) -> List[Dict]:
if not s:
return []
variants = []
for part in s.split(","):
part = part.strip()
if not part:
continue
parts = part.split()
if len(parts) >= 1:
url = parts[0]
width = (
parts[1].rstrip("w").split('.')[0]
if len(parts) > 1 and parts[1].endswith("w")
else None
)
variants.append({"url": url, "width": width})
return variants
# Function to parse image height/width value and units
def parse_dimension(dimension):
if dimension:
# match = re.match(r"(\d+)(\D*)", dimension)
match = DIMENSION_REGEX.match(dimension)
if match:
number = int(match.group(1))
unit = match.group(2) or "px" # Default unit is 'px' if not specified
return number, unit
return None, None
# Fetch image file metadata to extract size and extension
def fetch_image_file_size(img, base_url):
# If src is relative path construct full URL, if not it may be CDN URL
img_url = urljoin(base_url, img.get("src"))
try:
response = requests.head(img_url)
if response.status_code == 200:
return response.headers.get("Content-Length", None)
else:
print(f"Failed to retrieve file size for {img_url}")
return None
except InvalidSchema:
return None
finally:
return
class ContentScrapingStrategy(ABC):
@abstractmethod
def scrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
pass
@abstractmethod
async def ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
pass
class LXMLWebScrapingStrategy(ContentScrapingStrategy):
"""
LXML-based implementation for fast web content scraping.
This is the primary scraping strategy in Crawl4AI, providing high-performance
HTML parsing and content extraction using the lxml library.
Note: WebScrapingStrategy is now an alias for this class to maintain
backward compatibility.
"""
def __init__(self, logger=None):
self.logger = logger
self.DIMENSION_REGEX = re.compile(r"(\d+)(\D*)")
self.BASE64_PATTERN = re.compile(r'data:image/[^;]+;base64,([^"]+)')
def _log(self, level, message, tag="SCRAPE", **kwargs):
"""Helper method to safely use logger."""
if self.logger:
log_method = getattr(self.logger, level)
log_method(message=message, tag=tag, **kwargs)
def scrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
"""
Main entry point for content scraping.
Args:
url (str): The URL of the page to scrape.
html (str): The HTML content of the page.
**kwargs: Additional keyword arguments.
Returns:
ScrapingResult: A structured result containing the scraped content.
"""
actual_url = kwargs.get("redirected_url", url)
raw_result = self._scrap(actual_url, html, **kwargs)
if raw_result is None:
return ScrapingResult(
cleaned_html="",
success=False,
media=Media(),
links=Links(),
metadata={},
)
# Convert media items
media = Media(
images=[
MediaItem(**img)
for img in raw_result.get("media", {}).get("images", [])
if img
],
videos=[
MediaItem(**vid)
for vid in raw_result.get("media", {}).get("videos", [])
if vid
],
audios=[
MediaItem(**aud)
for aud in raw_result.get("media", {}).get("audios", [])
if aud
],
tables=raw_result.get("media", {}).get("tables", [])
)
# Convert links
links = Links(
internal=[
Link(**link)
for link in raw_result.get("links", {}).get("internal", [])
if link
],
external=[
Link(**link)
for link in raw_result.get("links", {}).get("external", [])
if link
],
)
return ScrapingResult(
cleaned_html=raw_result.get("cleaned_html", ""),
success=raw_result.get("success", False),
media=media,
links=links,
metadata=raw_result.get("metadata", {}),
)
async def ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
"""
Main entry point for asynchronous content scraping.
Args:
url (str): The URL of the page to scrape.
html (str): The HTML content of the page.
**kwargs: Additional keyword arguments.
Returns:
ScrapingResult: A structured result containing the scraped content.
"""
return await asyncio.to_thread(self.scrap, url, html, **kwargs)
def process_element(self, url, element: lhtml.HtmlElement, **kwargs) -> Dict[str, Any]:
"""
Process an HTML element.
How it works:
1. Check if the element is an image, video, or audio.
2. Extract the element's attributes and content.
3. Process the element based on its type.
4. Return the processed element information.
Args:
url (str): The URL of the page containing the element.
element (lhtml.HtmlElement): The HTML element to process.
**kwargs: Additional keyword arguments.
Returns:
dict: A dictionary containing the processed element information.
"""
media = {"images": [], "videos": [], "audios": [], "tables": []}
internal_links_dict = {}
external_links_dict = {}
self._process_element(
url, element, media, internal_links_dict, external_links_dict, **kwargs
)
return {
"media": media,
"internal_links_dict": internal_links_dict,
"external_links_dict": external_links_dict,
}
def _process_element(
self,
url: str,
element: lhtml.HtmlElement,
media: Dict[str, List],
internal_links_dict: Dict[str, Any],
external_links_dict: Dict[str, Any],
page_context: dict = None,
**kwargs,
) -> bool:
base_domain = kwargs.get("base_domain", get_base_domain(url))
exclude_domains = set(kwargs.get("exclude_domains", []))
# Process links
try:
base_element = element.xpath("//head/base[@href]")
if base_element:
base_href = base_element[0].get("href", "").strip()
if base_href:
url = base_href
except Exception as e:
self._log("error", f"Error extracting base URL: {str(e)}", "SCRAPE")
pass
for link in element.xpath(".//a[@href]"):
href = link.get("href", "").strip()
if not href:
continue
try:
normalized_href = normalize_url(
href, url,
preserve_https=kwargs.get('preserve_https_for_internal_links', False),
original_scheme=kwargs.get('original_scheme')
)
link_data = {
"href": normalized_href,
"text": link.text_content().strip(),
"title": link.get("title", "").strip(),
"base_domain": base_domain,
}
# Add intrinsic scoring if enabled
if kwargs.get("score_links", False) and page_context is not None:
try:
intrinsic_score = calculate_link_intrinsic_score(
link_text=link_data["text"],
url=normalized_href,
title_attr=link_data["title"],
class_attr=link.get("class", ""),
rel_attr=link.get("rel", ""),
page_context=page_context
)
link_data["intrinsic_score"] = intrinsic_score
except Exception:
# Fail gracefully - assign default score
link_data["intrinsic_score"] = 0
else:
# No scoring enabled - assign infinity (all links equal priority)
link_data["intrinsic_score"] = 0
is_external = is_external_url(normalized_href, base_domain)
if is_external:
link_base_domain = get_base_domain(normalized_href)
link_data["base_domain"] = link_base_domain
if (
kwargs.get("exclude_external_links", False)
or link_base_domain in exclude_domains
):
link.getparent().remove(link)
continue
if normalized_href not in external_links_dict:
external_links_dict[normalized_href] = link_data
else:
if normalized_href not in internal_links_dict:
internal_links_dict[normalized_href] = link_data
except Exception as e:
self._log("error", f"Error processing link: {str(e)}", "SCRAPE")
continue
# Process images
images = element.xpath(".//img")
total_images = len(images)
for idx, img in enumerate(images):
src = img.get("src") or ""
img_domain = get_base_domain(src)
# Decide if we need to exclude this image
# 1) If its domain is in exclude_domains, remove.
# 2) Or if exclude_external_images=True and it's an external domain, remove.
if (img_domain in exclude_domains) or (
kwargs.get("exclude_external_images", False)
and is_external_url(src, base_domain)
):
parent = img.getparent()
if parent is not None:
parent.remove(img)
continue
# Otherwise, process the image as usual.
try:
processed_images = self.process_image(
img, url, idx, total_images, **kwargs
)
if processed_images:
media["images"].extend(processed_images)
except Exception as e:
self._log("error", f"Error processing image: {str(e)}", "SCRAPE")
# Process videos and audios
for media_type in ["video", "audio"]:
for elem in element.xpath(f".//{media_type}"):
media_info = {
"src": elem.get("src"),
"alt": elem.get("alt"),
"type": media_type,
"description": self.find_closest_parent_with_useful_text(
elem, **kwargs
),
}
media[f"{media_type}s"].append(media_info)
# Process source tags within media elements
for source in elem.xpath(".//source"):
if src := source.get("src"):
media[f"{media_type}s"].append({**media_info, "src": src})
# Clean up unwanted elements
if kwargs.get("remove_forms", False):
for form in element.xpath(".//form"):
form.getparent().remove(form)
if excluded_tags := kwargs.get("excluded_tags", []):
for tag in excluded_tags:
for elem in element.xpath(f".//{tag}"):
elem.getparent().remove(elem)
if excluded_selector := kwargs.get("excluded_selector", ""):
try:
for elem in element.cssselect(excluded_selector):
elem.getparent().remove(elem)
except Exception:
pass # Invalid selector
return True
def find_closest_parent_with_useful_text(
self, element: lhtml.HtmlElement, **kwargs
) -> Optional[str]:
image_description_min_word_threshold = kwargs.get(
"image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
)
current = element
while current is not None:
if (
current.text
and len(current.text_content().split())
>= image_description_min_word_threshold
):
return current.text_content().strip()
current = current.getparent()
return None
def flatten_nested_elements(self, element: lhtml.HtmlElement) -> lhtml.HtmlElement:
"""Flatten nested elements of the same type in LXML tree"""
if len(element) == 1 and element.tag == element[0].tag:
return self.flatten_nested_elements(element[0])
for child in element:
child_idx = element.index(child)
flattened_child = self.flatten_nested_elements(child)
if flattened_child is not child: # Only replace if actually flattened
element[child_idx] = flattened_child
return element
def process_image(
self, img: lhtml.HtmlElement, url: str, index: int, total_images: int, **kwargs
) -> Optional[List[Dict]]:
# Quick validation checks
style = img.get("style", "")
alt = img.get("alt", "")
src = img.get("src", "")
data_src = img.get("data-src", "")
srcset = img.get("srcset", "")
data_srcset = img.get("data-srcset", "")
if "display:none" in style:
return None
parent = img.getparent()
if parent.tag in ["button", "input"]:
return None
parent_classes = parent.get("class", "").split()
if any(
"button" in cls or "icon" in cls or "logo" in cls for cls in parent_classes
):
return None
# If src is in class or alt, likely an icon
if (src and any(c in src for c in ["button", "icon", "logo"])) or (
alt and any(c in alt for c in ["button", "icon", "logo"])
):
return None
# Score calculation
score = 0
if (width := img.get("width")) and width.isdigit():
score += 1 if int(width) > 150 else 0
if (height := img.get("height")) and height.isdigit():
score += 1 if int(height) > 150 else 0
if alt:
score += 1
score += index / total_images < 0.5
# Check formats in all possible sources
image_formats = {"jpg", "jpeg", "png", "webp", "avif", "gif"}
detected_format = None
for url in [src, data_src, srcset, data_srcset]:
if url:
format_matches = [fmt for fmt in image_formats if fmt in url.lower()]
if format_matches:
detected_format = format_matches[0]
score += 1
break
if srcset or data_srcset:
score += 1
if picture := img.xpath("./ancestor::picture[1]"):
score += 1
if score <= kwargs.get("image_score_threshold", IMAGE_SCORE_THRESHOLD):
return None
# Process image variants
unique_urls = set()
image_variants = []
base_info = {
"alt": alt,
"desc": self.find_closest_parent_with_useful_text(img, **kwargs),
"score": score,
"type": "image",
"group_id": index,
"format": detected_format,
}
def add_variant(src: str, width: Optional[str] = None):
if src and not src.startswith("data:") and src not in unique_urls:
unique_urls.add(src)
variant = {**base_info, "src": src}
if width:
variant["width"] = width
image_variants.append(variant)
# Add variants from different sources
add_variant(src)
add_variant(data_src)
for srcset_attr in [srcset, data_srcset]:
if srcset_attr:
for source in parse_srcset(srcset_attr):
add_variant(source["url"], source["width"])
# Handle picture element
if picture:
for source in picture[0].xpath(".//source[@srcset]"):
if source_srcset := source.get("srcset"):
for src_data in parse_srcset(source_srcset):
add_variant(src_data["url"], src_data["width"])
# Check framework-specific attributes
for attr, value in img.attrib.items():
if (
attr.startswith("data-")
and ("src" in attr or "srcset" in attr)
and "http" in value
):
add_variant(value)
return image_variants if image_variants else None
def remove_empty_elements_fast(self, root, word_count_threshold=5):
"""
Remove elements that fall below the desired word threshold in a single pass from the bottom up.
Skips non-element nodes like HtmlComment and bypasses certain tags that are allowed to have no content.
"""
bypass_tags = {
"a",
"img",
"br",
"hr",
"input",
"meta",
"link",
"source",
"track",
"wbr",
"tr",
"td",
"th",
}
for el in reversed(list(root.iterdescendants())):
if not isinstance(el, lhtml.HtmlElement):
continue
if el.tag in bypass_tags:
continue
text_content = (el.text_content() or "").strip()
if (
len(text_content.split()) < word_count_threshold
and not el.getchildren()
):
parent = el.getparent()
if parent is not None:
parent.remove(el)
return root
def remove_unwanted_attributes_fast(
self, root: lhtml.HtmlElement, important_attrs=None, keep_data_attributes=False
) -> lhtml.HtmlElement:
"""
Removes all attributes from each element (including root) except those in `important_attrs`.
If `keep_data_attributes=True`, also retain any attribute starting with 'data-'.
Returns the same root element, mutated in-place, for fluent usage.
"""
if important_attrs is None:
important_attrs = set(IMPORTANT_ATTRS)
# If you want to handle the root as well, use 'include_self=True'
# so you don't miss attributes on the top-level element.
# Manually include the root, then all its descendants
for el in chain((root,), root.iterdescendants()):
# We only remove attributes on HtmlElement nodes, skip comments or text nodes
if not isinstance(el, lhtml.HtmlElement):
continue
old_attribs = dict(el.attrib)
new_attribs = {}
for attr_name, attr_val in old_attribs.items():
# If it's an important attribute, keep it
if attr_name in important_attrs:
new_attribs[attr_name] = attr_val
# Or if keep_data_attributes is True and it's a 'data-*' attribute
elif keep_data_attributes and attr_name.startswith("data-"):
new_attribs[attr_name] = attr_val
# Clear old attributes and set the filtered set
el.attrib.clear()
el.attrib.update(new_attribs)
return root
def _scrap(
self,
url: str,
html: str,
word_count_threshold: int = MIN_WORD_THRESHOLD,
css_selector: str = None,
target_elements: List[str] = None,
**kwargs,
) -> Dict[str, Any]:
if not html:
return None
success = True
try:
doc = lhtml.document_fromstring(html)
# Match BeautifulSoup's behavior of using body or full doc
# body = doc.xpath('//body')[0] if doc.xpath('//body') else doc
body = doc
base_domain = get_base_domain(url)
# Extract page context for link scoring (if enabled) - do this BEFORE any removals
page_context = None
if kwargs.get("score_links", False):
try:
# Extract title
title_elements = doc.xpath('//title')
page_title = title_elements[0].text_content() if title_elements else ""
# Extract headlines
headlines = []
for tag in ['h1', 'h2', 'h3']:
elements = doc.xpath(f'//{tag}')
for el in elements:
text = el.text_content().strip()
if text:
headlines.append(text)
headlines_text = ' '.join(headlines)
# Extract meta description
meta_desc_elements = doc.xpath('//meta[@name="description"]/@content')
meta_description = meta_desc_elements[0] if meta_desc_elements else ""
# Create page context
page_context = extract_page_context(page_title, headlines_text, meta_description, url)
except Exception:
page_context = {} # Fail gracefully
# Early removal of all images if exclude_all_images is set
# This is more efficient in lxml as we remove elements before any processing
if kwargs.get("exclude_all_images", False):
for img in body.xpath('//img'):
if img.getparent() is not None:
img.getparent().remove(img)
# Add comment removal
if kwargs.get("remove_comments", False):
comments = body.xpath("//comment()")
for comment in comments:
comment.getparent().remove(comment)
# Handle tag-based removal first
excluded_tags = set(kwargs.get("excluded_tags", []) or [])
if excluded_tags:
for tag in excluded_tags:
for element in body.xpath(f".//{tag}"):
if element.getparent() is not None:
element.getparent().remove(element)
# Handle CSS selector-based exclusion
excluded_selector = kwargs.get("excluded_selector", "")
if excluded_selector:
try:
for element in body.cssselect(excluded_selector):
if element.getparent() is not None:
element.getparent().remove(element)
except Exception as e:
self._log(
"error", f"Error with excluded CSS selector: {str(e)}", "SCRAPE"
)
# Extract metadata before any content filtering
try:
meta = extract_metadata_using_lxml(
"", doc
) # Using same function as BeautifulSoup version
except Exception as e:
self._log("error", f"Error extracting metadata: {str(e)}", "SCRAPE")
meta = {}
content_element = None
if target_elements:
try:
for_content_targeted_element = []
for target_element in target_elements:
for_content_targeted_element.extend(body.cssselect(target_element))
content_element = lhtml.Element("div")
content_element.extend(copy.deepcopy(for_content_targeted_element))
except Exception as e:
self._log("error", f"Error with target element detection: {str(e)}", "SCRAPE")
return None
else:
content_element = body
# Remove script and style tags
for tag in ["script", "style", "link", "meta", "noscript"]:
for element in body.xpath(f".//{tag}"):
if element.getparent() is not None:
element.getparent().remove(element)
# Handle social media and domain exclusions
kwargs["exclude_domains"] = set(kwargs.get("exclude_domains", []))
if kwargs.get("exclude_social_media_links", False):
kwargs["exclude_social_media_domains"] = set(
kwargs.get("exclude_social_media_domains", [])
+ SOCIAL_MEDIA_DOMAINS
)
kwargs["exclude_domains"].update(kwargs["exclude_social_media_domains"])
# Process forms if needed
if kwargs.get("remove_forms", False):
for form in body.xpath(".//form"):
if form.getparent() is not None:
form.getparent().remove(form)
# Process content
media = {"images": [], "videos": [], "audios": [], "tables": []}
internal_links_dict = {}
external_links_dict = {}
self._process_element(
url,
body,
media,
internal_links_dict,
external_links_dict,
page_context=page_context,
base_domain=base_domain,
**kwargs,
)
# Extract tables using the table extraction strategy if provided
if 'table' not in excluded_tags:
table_extraction = kwargs.get('table_extraction')
if table_extraction:
# Pass logger to the strategy if it doesn't have one
if not table_extraction.logger:
table_extraction.logger = self.logger
# Extract tables using the strategy
extracted_tables = table_extraction.extract_tables(body, **kwargs)
media["tables"].extend(extracted_tables)
# Handle only_text option
if kwargs.get("only_text", False):
for tag in ONLY_TEXT_ELIGIBLE_TAGS:
for element in body.xpath(f".//{tag}"):
if element.text:
new_text = lhtml.Element("span")
new_text.text = element.text_content()
if element.getparent() is not None:
element.getparent().replace(element, new_text)
# Clean base64 images
for img in body.xpath(".//img[@src]"):
src = img.get("src", "")
if self.BASE64_PATTERN.match(src):
img.set("src", self.BASE64_PATTERN.sub("", src))
# Remove empty elements
self.remove_empty_elements_fast(body, 1)
# Remove unneeded attributes
self.remove_unwanted_attributes_fast(
body, keep_data_attributes=kwargs.get("keep_data_attributes", False)
)
# Generate output HTML
cleaned_html = lhtml.tostring(
# body,
content_element,
encoding="unicode",
pretty_print=True,
method="html",
with_tail=False,
).strip()
# Create links dictionary in the format expected by LinkPreview
links = {
"internal": list(internal_links_dict.values()),
"external": list(external_links_dict.values()),
}
# Extract head content for links if configured
link_preview_config = kwargs.get("link_preview_config")
if link_preview_config is not None:
try:
import asyncio
from .link_preview import LinkPreview
from .models import Links, Link
verbose = link_preview_config.verbose
if verbose:
self._log("info", "Starting link head extraction for {internal} internal and {external} external links",
params={"internal": len(links["internal"]), "external": len(links["external"])}, tag="LINK_EXTRACT")
# Convert dict links to Link objects
internal_links = [Link(**link_data) for link_data in links["internal"]]
external_links = [Link(**link_data) for link_data in links["external"]]
links_obj = Links(internal=internal_links, external=external_links)
# Create a config object for LinkPreview
class TempCrawlerRunConfig:
def __init__(self, link_config, score_links):
self.link_preview_config = link_config
self.score_links = score_links
config = TempCrawlerRunConfig(link_preview_config, kwargs.get("score_links", False))
# Extract head content (run async operation in sync context)
async def extract_links():
async with LinkPreview(self.logger) as extractor:
return await extractor.extract_link_heads(links_obj, config)
# Run the async operation
try:
# Check if we're already in an async context
loop = asyncio.get_running_loop()
# If we're in an async context, we need to run in a thread
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(asyncio.run, extract_links())
updated_links = future.result()
except RuntimeError:
# No running loop, we can use asyncio.run directly
updated_links = asyncio.run(extract_links())
# Convert back to dict format
links["internal"] = [link.dict() for link in updated_links.internal]
links["external"] = [link.dict() for link in updated_links.external]
if verbose:
successful_internal = len([l for l in updated_links.internal if l.head_extraction_status == "valid"])
successful_external = len([l for l in updated_links.external if l.head_extraction_status == "valid"])
self._log("info", "Link head extraction completed: {internal_success}/{internal_total} internal, {external_success}/{external_total} external",
params={
"internal_success": successful_internal,
"internal_total": len(updated_links.internal),
"external_success": successful_external,
"external_total": len(updated_links.external)
}, tag="LINK_EXTRACT")
else:
self._log("info", "Link head extraction completed successfully", tag="LINK_EXTRACT")
except Exception as e:
self._log("error", f"Error during link head extraction: {str(e)}", tag="LINK_EXTRACT")
# Continue with original links if head extraction fails
return {
"cleaned_html": cleaned_html,
"success": success,
"media": media,
"links": links,
"metadata": meta,
}
except Exception as e:
self._log("error", f"Error processing HTML: {str(e)}", "SCRAPE")
# Create error message in case of failure
error_body = lhtml.Element("div")
# Use etree.SubElement rather than lhtml.SubElement
error_div = etree.SubElement(error_body, "div", id="crawl4ai_error_message")
error_div.text = f"""
Crawl4AI Error: This page is not fully supported.
Error Message: {str(e)}
Possible reasons:
1. The page may have restrictions that prevent crawling.
2. The page might not be fully loaded.
Suggestions:
- Try calling the crawl function with these parameters:
magic=True,
- Set headless=False to visualize what's happening on the page.
If the issue persists, please check the page's structure and any potential anti-crawling measures.
"""
cleaned_html = lhtml.tostring(
error_body, encoding="unicode", pretty_print=True
)
return {
"cleaned_html": cleaned_html,
"success": False,
"media": {
"images": [],
"videos": [],
"audios": [],
"tables": []
},
"links": {"internal": [], "external": []},
"metadata": {},
}
# Backward compatibility alias
WebScrapingStrategy = LXMLWebScrapingStrategy