Compare commits
39 Commits
implement-
...
fix/relati
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
813b1f5534 | ||
|
|
0482c1eafc | ||
|
|
1eacea1d2d | ||
|
|
bc6d8147d2 | ||
|
|
487839640f | ||
|
|
6772134a3a | ||
|
|
ae67d66b81 | ||
|
|
af28e84a21 | ||
|
|
5e7fcb17e1 | ||
|
|
6e728096fa | ||
|
|
2de200c1ba | ||
|
|
9749e2832d | ||
|
|
70f473b84d | ||
|
|
bdacf61ca9 | ||
|
|
f566c5a376 | ||
|
|
4e1c4bd24e | ||
|
|
2ad3fb5fc8 | ||
|
|
cce3390a2d | ||
|
|
4fe2d01361 | ||
|
|
159207b86f | ||
|
|
38f3ea42a7 | ||
|
|
102352eac4 | ||
|
|
f2da460bb9 | ||
|
|
b1dff5a4d3 | ||
|
|
40ab287c90 | ||
|
|
c09a57644f | ||
|
|
90af453506 | ||
|
|
8bb0e68cce | ||
|
|
95051020f4 | ||
|
|
69961cf40b | ||
|
|
ef174a4c7a | ||
|
|
f4206d6ba1 | ||
|
|
9447054a65 | ||
|
|
dad7c51481 | ||
|
|
f4a432829e | ||
|
|
ecbe5ffb84 | ||
|
|
7a8190ecb6 | ||
|
|
8e3c411a3e | ||
|
|
1e1c887a2f |
10
CHANGELOG.md
10
CHANGELOG.md
@@ -5,6 +5,16 @@ All notable changes to Crawl4AI will be documented in this file.
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## [Unreleased]
|
||||
|
||||
### Added
|
||||
- **🔒 HTTPS Preservation for Internal Links**: New `preserve_https_for_internal_links` configuration flag
|
||||
- Maintains HTTPS scheme for internal links even when servers redirect to HTTP
|
||||
- Prevents security downgrades during deep crawling
|
||||
- Useful for security-conscious crawling and sites supporting both protocols
|
||||
- Fully backward compatible with opt-in flag (default: `False`)
|
||||
- Fixes issue #1410 where HTTPS URLs were being downgraded to HTTP
|
||||
|
||||
## [0.7.3] - 2025-08-09
|
||||
|
||||
### Added
|
||||
|
||||
40
README.md
40
README.md
@@ -304,9 +304,9 @@ The new Docker implementation includes:
|
||||
### Getting Started
|
||||
|
||||
```bash
|
||||
# Pull and run the latest release candidate
|
||||
docker pull unclecode/crawl4ai:0.7.0
|
||||
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.0
|
||||
# Pull and run the latest release
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
|
||||
|
||||
# Visit the playground at http://localhost:11235/playground
|
||||
```
|
||||
@@ -373,7 +373,7 @@ async def main():
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://docs.micronaut.io/4.7.6/guide/",
|
||||
url="https://docs.micronaut.io/4.9.9/guide/",
|
||||
config=run_config
|
||||
)
|
||||
print(len(result.markdown.raw_markdown))
|
||||
@@ -425,7 +425,7 @@ async def main():
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
@@ -919,36 +919,6 @@ We envision a future where AI is powered by real human knowledge, ensuring data
|
||||
For more details, see our [full mission statement](./MISSION.md).
|
||||
</details>
|
||||
|
||||
## 🌟 Current Sponsors
|
||||
|
||||
### 🏢 Enterprise Sponsors & Partners
|
||||
|
||||
Our enterprise sponsors and technology partners help scale Crawl4AI to power production-grade data pipelines.
|
||||
|
||||
| Company | About | Sponsorship Tier |
|
||||
|------|------|----------------------------|
|
||||
| <a href="https://dashboard.capsolver.com/passport/register?inviteCode=ESVSECTX5Q23" target="_blank"><picture><source width="120" media="(prefers-color-scheme: dark)" srcset="https://docs.crawl4ai.com/uploads/sponsors/20251013045338_72a71fa4ee4d2f40.png"><source width="120" media="(prefers-color-scheme: light)" srcset="https://www.capsolver.com/assets/images/logo-text.png"><img alt="Capsolver" src="https://www.capsolver.com/assets/images/logo-text.png"></picture></a> | AI-powered Captcha solving service. Supports all major Captcha types, including reCAPTCHA, Cloudflare, and more | 🥈 Silver |
|
||||
| <a href="https://kipo.ai" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013045751_2d54f57f117c651e.png" alt="DataSync" width="120"/></a> | Helps engineers and buyers find, compare, and source electronic & industrial parts in seconds, with specs, pricing, lead times & alternatives.| 🥇 Gold |
|
||||
| <a href="https://www.kidocode.com/" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013045045_bb8dace3f0440d65.svg" alt="Kidocode" width="120"/><p align="center">KidoCode</p></a> | Kidocode is a hybrid technology and entrepreneurship school for kids aged 5–18, offering both online and on-campus education. | 🥇 Gold |
|
||||
| <a href="https://www.alephnull.sg/" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013050323_a9e8e8c4c3650421.svg" alt="Aleph null" width="120"/></a> | Singapore-based Aleph Null is Asia’s leading edtech hub, dedicated to student-centric, AI-driven education—empowering learners with the tools to thrive in a fast-changing world. | 🥇 Gold |
|
||||
|
||||
### 🧑🤝 Individual Sponsors
|
||||
|
||||
A heartfelt thanks to our individual supporters! Every contribution helps us keep our opensource mission alive and thriving!
|
||||
|
||||
<p align="left">
|
||||
<a href="https://github.com/hafezparast"><img src="https://avatars.githubusercontent.com/u/14273305?s=60&v=4" style="border-radius:50%;" width="64px;"/></a>
|
||||
<a href="https://github.com/ntohidi"><img src="https://avatars.githubusercontent.com/u/17140097?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
|
||||
<a href="https://github.com/Sjoeborg"><img src="https://avatars.githubusercontent.com/u/17451310?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
|
||||
<a href="https://github.com/romek-rozen"><img src="https://avatars.githubusercontent.com/u/30595969?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
|
||||
<a href="https://github.com/Kourosh-Kiyani"><img src="https://avatars.githubusercontent.com/u/34105600?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
|
||||
<a href="https://github.com/Etherdrake"><img src="https://avatars.githubusercontent.com/u/67021215?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
|
||||
<a href="https://github.com/shaman247"><img src="https://avatars.githubusercontent.com/u/211010067?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
|
||||
<a href="https://github.com/work-flow-manager"><img src="https://avatars.githubusercontent.com/u/217665461?s=60&v=4" style="border-radius:50%;"width="64px;"/></a>
|
||||
</p>
|
||||
|
||||
> Want to join them? [Sponsor Crawl4AI →](https://github.com/sponsors/unclecode)
|
||||
|
||||
## Star History
|
||||
|
||||
[](https://star-history.com/#unclecode/crawl4ai&Date)
|
||||
|
||||
@@ -97,13 +97,16 @@ def to_serializable_dict(obj: Any, ignore_default_value : bool = False) -> Dict:
|
||||
if value != param.default and not ignore_default_value:
|
||||
current_values[name] = to_serializable_dict(value)
|
||||
|
||||
if hasattr(obj, '__slots__'):
|
||||
for slot in obj.__slots__:
|
||||
if slot.startswith('_'): # Handle private slots
|
||||
attr_name = slot[1:] # Remove leading '_'
|
||||
value = getattr(obj, slot, None)
|
||||
if value is not None:
|
||||
current_values[attr_name] = to_serializable_dict(value)
|
||||
# Don't serialize private __slots__ - they're internal implementation details
|
||||
# not constructor parameters. This was causing URLPatternFilter to fail
|
||||
# because _simple_suffixes was being serialized as 'simple_suffixes'
|
||||
# if hasattr(obj, '__slots__'):
|
||||
# for slot in obj.__slots__:
|
||||
# if slot.startswith('_'): # Handle private slots
|
||||
# attr_name = slot[1:] # Remove leading '_'
|
||||
# value = getattr(obj, slot, None)
|
||||
# if value is not None:
|
||||
# current_values[attr_name] = to_serializable_dict(value)
|
||||
|
||||
|
||||
|
||||
@@ -831,12 +834,6 @@ class HTTPCrawlerConfig:
|
||||
return HTTPCrawlerConfig.from_kwargs(config)
|
||||
|
||||
class CrawlerRunConfig():
|
||||
_UNWANTED_PROPS = {
|
||||
'disable_cache' : 'Instead, use cache_mode=CacheMode.DISABLED',
|
||||
'bypass_cache' : 'Instead, use cache_mode=CacheMode.BYPASS',
|
||||
'no_cache_read' : 'Instead, use cache_mode=CacheMode.WRITE_ONLY',
|
||||
'no_cache_write' : 'Instead, use cache_mode=CacheMode.READ_ONLY',
|
||||
}
|
||||
|
||||
"""
|
||||
Configuration class for controlling how the crawler runs each crawl operation.
|
||||
@@ -1043,6 +1040,12 @@ class CrawlerRunConfig():
|
||||
|
||||
url: str = None # This is not a compulsory parameter
|
||||
"""
|
||||
_UNWANTED_PROPS = {
|
||||
'disable_cache' : 'Instead, use cache_mode=CacheMode.DISABLED',
|
||||
'bypass_cache' : 'Instead, use cache_mode=CacheMode.BYPASS',
|
||||
'no_cache_read' : 'Instead, use cache_mode=CacheMode.WRITE_ONLY',
|
||||
'no_cache_write' : 'Instead, use cache_mode=CacheMode.READ_ONLY',
|
||||
}
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
@@ -1121,6 +1124,7 @@ class CrawlerRunConfig():
|
||||
exclude_domains: list = None,
|
||||
exclude_internal_links: bool = False,
|
||||
score_links: bool = False,
|
||||
preserve_https_for_internal_links: bool = False,
|
||||
# Debugging and Logging Parameters
|
||||
verbose: bool = True,
|
||||
log_console: bool = False,
|
||||
@@ -1244,6 +1248,7 @@ class CrawlerRunConfig():
|
||||
self.exclude_domains = exclude_domains or []
|
||||
self.exclude_internal_links = exclude_internal_links
|
||||
self.score_links = score_links
|
||||
self.preserve_https_for_internal_links = preserve_https_for_internal_links
|
||||
|
||||
# Debugging and Logging Parameters
|
||||
self.verbose = verbose
|
||||
@@ -1517,6 +1522,7 @@ class CrawlerRunConfig():
|
||||
exclude_domains=kwargs.get("exclude_domains", []),
|
||||
exclude_internal_links=kwargs.get("exclude_internal_links", False),
|
||||
score_links=kwargs.get("score_links", False),
|
||||
preserve_https_for_internal_links=kwargs.get("preserve_https_for_internal_links", False),
|
||||
# Debugging and Logging Parameters
|
||||
verbose=kwargs.get("verbose", True),
|
||||
log_console=kwargs.get("log_console", False),
|
||||
@@ -1623,6 +1629,7 @@ class CrawlerRunConfig():
|
||||
"exclude_domains": self.exclude_domains,
|
||||
"exclude_internal_links": self.exclude_internal_links,
|
||||
"score_links": self.score_links,
|
||||
"preserve_https_for_internal_links": self.preserve_https_for_internal_links,
|
||||
"verbose": self.verbose,
|
||||
"log_console": self.log_console,
|
||||
"capture_network_requests": self.capture_network_requests,
|
||||
|
||||
@@ -824,7 +824,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
except Error:
|
||||
visibility_info = await self.check_visibility(page)
|
||||
|
||||
if self.browser_config.config.verbose:
|
||||
if self.browser_config.verbose:
|
||||
self.logger.debug(
|
||||
message="Body visibility info: {info}",
|
||||
tag="DEBUG",
|
||||
|
||||
@@ -1037,7 +1037,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
downloaded_files=(
|
||||
self._downloaded_files if self._downloaded_files else None
|
||||
),
|
||||
redirected_url=redirected_url,
|
||||
redirected_url=page.url, # Update to current URL in case of JavaScript navigation
|
||||
# Include captured data if enabled
|
||||
network_requests=captured_requests if config.capture_network_requests else None,
|
||||
console_messages=captured_console if config.capture_console_messages else None,
|
||||
|
||||
@@ -354,6 +354,7 @@ class AsyncWebCrawler:
|
||||
###############################################################
|
||||
# Process the HTML content, Call CrawlerStrategy.process_html #
|
||||
###############################################################
|
||||
from urllib.parse import urlparse
|
||||
crawl_result: CrawlResult = await self.aprocess_html(
|
||||
url=url,
|
||||
html=html,
|
||||
@@ -364,6 +365,7 @@ class AsyncWebCrawler:
|
||||
verbose=config.verbose,
|
||||
is_raw_html=True if url.startswith("raw:") else False,
|
||||
redirected_url=async_response.redirected_url,
|
||||
original_scheme=urlparse(url).scheme,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
@@ -478,7 +480,7 @@ class AsyncWebCrawler:
|
||||
# Scraping Strategy Execution #
|
||||
################################
|
||||
result: ScrapingResult = scraping_strategy.scrap(
|
||||
url, html, **params)
|
||||
kwargs.get("redirected_url", url), html, **params)
|
||||
|
||||
if result is None:
|
||||
raise ValueError(
|
||||
|
||||
@@ -258,7 +258,11 @@ class LXMLWebScrapingStrategy(ContentScrapingStrategy):
|
||||
continue
|
||||
|
||||
try:
|
||||
normalized_href = normalize_url(href, url)
|
||||
normalized_href = normalize_url(
|
||||
href, url,
|
||||
preserve_https=kwargs.get('preserve_https_for_internal_links', False),
|
||||
original_scheme=kwargs.get('original_scheme')
|
||||
)
|
||||
link_data = {
|
||||
"href": normalized_href,
|
||||
"text": link.text_content().strip(),
|
||||
|
||||
@@ -47,7 +47,13 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
|
||||
self.url_scorer = url_scorer
|
||||
self.include_external = include_external
|
||||
self.max_pages = max_pages
|
||||
self.logger = logger or logging.getLogger(__name__)
|
||||
# self.logger = logger or logging.getLogger(__name__)
|
||||
# Ensure logger is always a Logger instance, not a dict from serialization
|
||||
if isinstance(logger, logging.Logger):
|
||||
self.logger = logger
|
||||
else:
|
||||
# Create a new logger if logger is None, dict, or any other non-Logger type
|
||||
self.logger = logging.getLogger(__name__)
|
||||
self.stats = TraversalStats(start_time=datetime.now())
|
||||
self._cancel_event = asyncio.Event()
|
||||
self._pages_crawled = 0
|
||||
|
||||
@@ -38,7 +38,13 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
|
||||
self.include_external = include_external
|
||||
self.score_threshold = score_threshold
|
||||
self.max_pages = max_pages
|
||||
self.logger = logger or logging.getLogger(__name__)
|
||||
# self.logger = logger or logging.getLogger(__name__)
|
||||
# Ensure logger is always a Logger instance, not a dict from serialization
|
||||
if isinstance(logger, logging.Logger):
|
||||
self.logger = logger
|
||||
else:
|
||||
# Create a new logger if logger is None, dict, or any other non-Logger type
|
||||
self.logger = logging.getLogger(__name__)
|
||||
self.stats = TraversalStats(start_time=datetime.now())
|
||||
self._cancel_event = asyncio.Event()
|
||||
self._pages_crawled = 0
|
||||
|
||||
@@ -120,6 +120,9 @@ class URLPatternFilter(URLFilter):
|
||||
"""Pattern filter balancing speed and completeness"""
|
||||
|
||||
__slots__ = (
|
||||
"patterns", # Store original patterns for serialization
|
||||
"use_glob", # Store original use_glob for serialization
|
||||
"reverse", # Store original reverse for serialization
|
||||
"_simple_suffixes",
|
||||
"_simple_prefixes",
|
||||
"_domain_patterns",
|
||||
@@ -142,6 +145,11 @@ class URLPatternFilter(URLFilter):
|
||||
reverse: bool = False,
|
||||
):
|
||||
super().__init__()
|
||||
# Store original constructor params for serialization
|
||||
self.patterns = patterns
|
||||
self.use_glob = use_glob
|
||||
self.reverse = reverse
|
||||
|
||||
self._reverse = reverse
|
||||
patterns = [patterns] if isinstance(patterns, (str, Pattern)) else patterns
|
||||
|
||||
|
||||
@@ -253,6 +253,16 @@ class CrawlResult(BaseModel):
|
||||
requirements change, this is where you would update the logic.
|
||||
"""
|
||||
result = super().model_dump(*args, **kwargs)
|
||||
|
||||
# Remove any property descriptors that might have been included
|
||||
# These deprecated properties should not be in the serialized output
|
||||
for key in ['fit_html', 'fit_markdown', 'markdown_v2']:
|
||||
if key in result and isinstance(result[key], property):
|
||||
# del result[key]
|
||||
# Nasrin: I decided to convert it to string instead of removing it.
|
||||
result[key] = str(result[key])
|
||||
|
||||
# Add the markdown field properly
|
||||
if self._markdown is not None:
|
||||
result["markdown"] = self._markdown.model_dump()
|
||||
return result
|
||||
|
||||
@@ -1790,6 +1790,10 @@ def perform_completion_with_backoff(
|
||||
except RateLimitError as e:
|
||||
print("Rate limit error:", str(e))
|
||||
|
||||
if attempt == max_attempts - 1:
|
||||
# Last attempt failed, raise the error.
|
||||
raise
|
||||
|
||||
# Check if we have exhausted our max attempts
|
||||
if attempt < max_attempts - 1:
|
||||
# Calculate the delay and wait
|
||||
@@ -2145,8 +2149,12 @@ def normalize_url(
|
||||
*,
|
||||
drop_query_tracking=True,
|
||||
sort_query=True,
|
||||
keep_fragment=False,
|
||||
extra_drop_params=None
|
||||
keep_fragment=True,
|
||||
remove_fragments=None, # alias for keep_fragment=False
|
||||
extra_drop_params=None,
|
||||
params_to_remove=None, # alias for extra_drop_params
|
||||
preserve_https=False,
|
||||
original_scheme=None
|
||||
):
|
||||
"""
|
||||
Extended URL normalizer
|
||||
@@ -2169,25 +2177,64 @@ def normalize_url(
|
||||
Returns
|
||||
-------
|
||||
str | None
|
||||
A clean, canonical URL or None if href is empty/None.
|
||||
A clean, canonical URL or the base URL if href is empty/None.
|
||||
"""
|
||||
if not href:
|
||||
return None
|
||||
# For empty href, return the base URL (matching urljoin behavior)
|
||||
return base_url
|
||||
|
||||
# Validate base URL format
|
||||
parsed_base = urlparse(base_url)
|
||||
if not parsed_base.scheme or not parsed_base.netloc:
|
||||
raise ValueError(f"Invalid base URL format: {base_url}")
|
||||
|
||||
if parsed_base.scheme.lower() not in ["http", "https"]:
|
||||
# Handle special protocols
|
||||
raise ValueError(f"Invalid base URL format: {base_url}")
|
||||
|
||||
# Resolve relative paths first
|
||||
full_url = urljoin(base_url, href.strip())
|
||||
|
||||
# Preserve HTTPS if requested and original scheme was HTTPS
|
||||
if preserve_https and original_scheme == 'https':
|
||||
parsed_full = urlparse(full_url)
|
||||
parsed_base = urlparse(base_url)
|
||||
# Only preserve HTTPS for same-domain links (not protocol-relative URLs)
|
||||
# Protocol-relative URLs (//example.com) should follow the base URL's scheme
|
||||
if (parsed_full.scheme == 'http' and
|
||||
parsed_full.netloc == parsed_base.netloc and
|
||||
not href.strip().startswith('//')):
|
||||
full_url = full_url.replace('http://', 'https://', 1)
|
||||
|
||||
# Parse once, edit parts, then rebuild
|
||||
parsed = urlparse(full_url)
|
||||
|
||||
# ── netloc ──
|
||||
netloc = parsed.netloc.lower()
|
||||
|
||||
# Remove default ports (80 for http, 443 for https)
|
||||
if ':' in netloc:
|
||||
host, port = netloc.rsplit(':', 1)
|
||||
if (parsed.scheme == 'http' and port == '80') or (parsed.scheme == 'https' and port == '443'):
|
||||
netloc = host
|
||||
|
||||
# ── path ──
|
||||
# Strip duplicate slashes and trailing “/” (except root)
|
||||
path = quote(unquote(parsed.path))
|
||||
# Strip duplicate slashes and trailing "/" (except root)
|
||||
# IMPORTANT: Don't use quote(unquote()) as it mangles + signs in URLs
|
||||
# The path from urlparse is already properly encoded
|
||||
path = parsed.path
|
||||
if path.endswith('/') and path != '/':
|
||||
path = path.rstrip('/')
|
||||
# Only strip trailing slash if the original href didn't have a trailing slash
|
||||
# and the base_url didn't end with a slash
|
||||
base_parsed = urlparse(base_url)
|
||||
if not href.strip().endswith('/') and not base_parsed.path.endswith('/'):
|
||||
path = path.rstrip('/')
|
||||
# Add trailing slash for URLs without explicit paths (indicates directory)
|
||||
# But skip this for special protocols that don't use standard URL structure
|
||||
elif not path:
|
||||
special_protocols = {"javascript:", "mailto:", "tel:", "file:", "data:"}
|
||||
if not any(href.strip().lower().startswith(p) for p in special_protocols):
|
||||
path = '/'
|
||||
|
||||
# ── query ──
|
||||
query = parsed.query
|
||||
@@ -2202,6 +2249,8 @@ def normalize_url(
|
||||
}
|
||||
if extra_drop_params:
|
||||
default_tracking |= {p.lower() for p in extra_drop_params}
|
||||
if params_to_remove:
|
||||
default_tracking |= {p.lower() for p in params_to_remove}
|
||||
params = [(k, v) for k, v in params if k not in default_tracking]
|
||||
|
||||
if sort_query:
|
||||
@@ -2210,7 +2259,10 @@ def normalize_url(
|
||||
query = urlencode(params, doseq=True) if params else ''
|
||||
|
||||
# ── fragment ──
|
||||
fragment = parsed.fragment if keep_fragment else ''
|
||||
if remove_fragments is True:
|
||||
fragment = ''
|
||||
else:
|
||||
fragment = parsed.fragment if keep_fragment else ''
|
||||
|
||||
# Re-assemble
|
||||
normalized = urlunparse((
|
||||
@@ -2225,7 +2277,7 @@ def normalize_url(
|
||||
return normalized
|
||||
|
||||
|
||||
def normalize_url_for_deep_crawl(href, base_url):
|
||||
def normalize_url_for_deep_crawl(href, base_url, preserve_https=False, original_scheme=None):
|
||||
"""Normalize URLs to ensure consistent format"""
|
||||
from urllib.parse import urljoin, urlparse, urlunparse, parse_qs, urlencode
|
||||
|
||||
@@ -2236,6 +2288,17 @@ def normalize_url_for_deep_crawl(href, base_url):
|
||||
# Use urljoin to handle relative URLs
|
||||
full_url = urljoin(base_url, href.strip())
|
||||
|
||||
# Preserve HTTPS if requested and original scheme was HTTPS
|
||||
if preserve_https and original_scheme == 'https':
|
||||
parsed_full = urlparse(full_url)
|
||||
parsed_base = urlparse(base_url)
|
||||
# Only preserve HTTPS for same-domain links (not protocol-relative URLs)
|
||||
# Protocol-relative URLs (//example.com) should follow the base URL's scheme
|
||||
if (parsed_full.scheme == 'http' and
|
||||
parsed_full.netloc == parsed_base.netloc and
|
||||
not href.strip().startswith('//')):
|
||||
full_url = full_url.replace('http://', 'https://', 1)
|
||||
|
||||
# Parse the URL for normalization
|
||||
parsed = urlparse(full_url)
|
||||
|
||||
@@ -2273,7 +2336,7 @@ def normalize_url_for_deep_crawl(href, base_url):
|
||||
return normalized
|
||||
|
||||
@lru_cache(maxsize=10000)
|
||||
def efficient_normalize_url_for_deep_crawl(href, base_url):
|
||||
def efficient_normalize_url_for_deep_crawl(href, base_url, preserve_https=False, original_scheme=None):
|
||||
"""Efficient URL normalization with proper parsing"""
|
||||
from urllib.parse import urljoin
|
||||
|
||||
@@ -2283,6 +2346,17 @@ def efficient_normalize_url_for_deep_crawl(href, base_url):
|
||||
# Resolve relative URLs
|
||||
full_url = urljoin(base_url, href.strip())
|
||||
|
||||
# Preserve HTTPS if requested and original scheme was HTTPS
|
||||
if preserve_https and original_scheme == 'https':
|
||||
parsed_full = urlparse(full_url)
|
||||
parsed_base = urlparse(base_url)
|
||||
# Only preserve HTTPS for same-domain links (not protocol-relative URLs)
|
||||
# Protocol-relative URLs (//example.com) should follow the base URL's scheme
|
||||
if (parsed_full.scheme == 'http' and
|
||||
parsed_full.netloc == parsed_base.netloc and
|
||||
not href.strip().startswith('//')):
|
||||
full_url = full_url.replace('http://', 'https://', 1)
|
||||
|
||||
# Use proper URL parsing
|
||||
parsed = urlparse(full_url)
|
||||
|
||||
@@ -2412,9 +2486,19 @@ def is_external_url(url: str, base_domain: str) -> bool:
|
||||
if not parsed.netloc: # Relative URL
|
||||
return False
|
||||
|
||||
# Strip 'www.' from both domains for comparison
|
||||
url_domain = parsed.netloc.lower().replace("www.", "")
|
||||
base = base_domain.lower().replace("www.", "")
|
||||
# Don't strip 'www.' from domains for comparison - treat www.example.com and example.com as different
|
||||
url_domain = parsed.netloc.lower()
|
||||
base = base_domain.lower()
|
||||
|
||||
# Strip user credentials from URL domain
|
||||
if '@' in url_domain:
|
||||
url_domain = url_domain.split('@', 1)[1]
|
||||
|
||||
# Strip ports from both for comparison (any port should be considered same domain)
|
||||
if ':' in url_domain:
|
||||
url_domain = url_domain.rsplit(':', 1)[0]
|
||||
if ':' in base:
|
||||
base = base.rsplit(':', 1)[0]
|
||||
|
||||
# Check if URL domain ends with base domain
|
||||
return not url_domain.endswith(base)
|
||||
|
||||
@@ -10,4 +10,23 @@ GEMINI_API_TOKEN=your_gemini_key_here
|
||||
# Optional: Override the default LLM provider
|
||||
# Examples: "openai/gpt-4", "anthropic/claude-3-opus", "deepseek/chat", etc.
|
||||
# If not set, uses the provider specified in config.yml (default: openai/gpt-4o-mini)
|
||||
# LLM_PROVIDER=anthropic/claude-3-opus
|
||||
# LLM_PROVIDER=anthropic/claude-3-opus
|
||||
|
||||
# Optional: Global LLM temperature setting (0.0-2.0)
|
||||
# Controls randomness in responses. Lower = more focused, Higher = more creative
|
||||
# LLM_TEMPERATURE=0.7
|
||||
|
||||
# Optional: Global custom API base URL
|
||||
# Use this to point to custom endpoints or proxy servers
|
||||
# LLM_BASE_URL=https://api.custom.com/v1
|
||||
|
||||
# Optional: Provider-specific temperature overrides
|
||||
# These take precedence over the global LLM_TEMPERATURE
|
||||
# OPENAI_TEMPERATURE=0.5
|
||||
# ANTHROPIC_TEMPERATURE=0.3
|
||||
# GROQ_TEMPERATURE=0.8
|
||||
|
||||
# Optional: Provider-specific base URL overrides
|
||||
# Use for provider-specific proxy endpoints
|
||||
# OPENAI_BASE_URL=https://custom-openai.company.com/v1
|
||||
# GROQ_BASE_URL=https://custom-groq.company.com/v1
|
||||
@@ -12,7 +12,6 @@
|
||||
- [Python SDK](#python-sdk)
|
||||
- [Understanding Request Schema](#understanding-request-schema)
|
||||
- [REST API Examples](#rest-api-examples)
|
||||
- [Asynchronous Jobs with Webhooks](#asynchronous-jobs-with-webhooks)
|
||||
- [Additional API Endpoints](#additional-api-endpoints)
|
||||
- [HTML Extraction Endpoint](#html-extraction-endpoint)
|
||||
- [Screenshot Endpoint](#screenshot-endpoint)
|
||||
@@ -649,146 +648,6 @@ async def test_stream_crawl(token: str = None): # Made token optional
|
||||
# asyncio.run(test_stream_crawl())
|
||||
```
|
||||
|
||||
### Asynchronous Jobs with Webhooks
|
||||
|
||||
For long-running crawls or when you want to avoid keeping connections open, use the job queue endpoints. Instead of polling for results, configure a webhook to receive notifications when jobs complete.
|
||||
|
||||
#### Why Use Jobs & Webhooks?
|
||||
|
||||
- **No Polling Required** - Get notified when crawls complete instead of constantly checking status
|
||||
- **Better Resource Usage** - Free up client connections while jobs run in the background
|
||||
- **Scalable Architecture** - Ideal for high-volume crawling with TypeScript/Node.js clients or microservices
|
||||
- **Reliable Delivery** - Automatic retry with exponential backoff (5 attempts: 1s → 2s → 4s → 8s → 16s)
|
||||
|
||||
#### How It Works
|
||||
|
||||
1. **Submit Job** → POST to `/crawl/job` with optional `webhook_config`
|
||||
2. **Get Task ID** → Receive a `task_id` immediately
|
||||
3. **Job Runs** → Crawl executes in the background
|
||||
4. **Webhook Fired** → Server POSTs completion notification to your webhook URL
|
||||
5. **Fetch Results** → If data wasn't included in webhook, GET `/crawl/job/{task_id}`
|
||||
|
||||
#### Quick Example
|
||||
|
||||
```bash
|
||||
# Submit a crawl job with webhook notification
|
||||
curl -X POST http://localhost:11235/crawl/job \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"urls": ["https://example.com"],
|
||||
"webhook_config": {
|
||||
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
|
||||
"webhook_data_in_payload": false
|
||||
}
|
||||
}'
|
||||
|
||||
# Response: {"task_id": "crawl_a1b2c3d4"}
|
||||
```
|
||||
|
||||
**Your webhook receives:**
|
||||
```json
|
||||
{
|
||||
"task_id": "crawl_a1b2c3d4",
|
||||
"task_type": "crawl",
|
||||
"status": "completed",
|
||||
"timestamp": "2025-10-21T10:30:00.000000+00:00",
|
||||
"urls": ["https://example.com"]
|
||||
}
|
||||
```
|
||||
|
||||
Then fetch the results:
|
||||
```bash
|
||||
curl http://localhost:11235/crawl/job/crawl_a1b2c3d4
|
||||
```
|
||||
|
||||
#### Include Data in Webhook
|
||||
|
||||
Set `webhook_data_in_payload: true` to receive the full crawl results directly in the webhook:
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:11235/crawl/job \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"urls": ["https://example.com"],
|
||||
"webhook_config": {
|
||||
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
|
||||
"webhook_data_in_payload": true
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
**Your webhook receives the complete data:**
|
||||
```json
|
||||
{
|
||||
"task_id": "crawl_a1b2c3d4",
|
||||
"task_type": "crawl",
|
||||
"status": "completed",
|
||||
"timestamp": "2025-10-21T10:30:00.000000+00:00",
|
||||
"urls": ["https://example.com"],
|
||||
"data": {
|
||||
"markdown": "...",
|
||||
"html": "...",
|
||||
"links": {...},
|
||||
"metadata": {...}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Webhook Authentication
|
||||
|
||||
Add custom headers for authentication:
|
||||
|
||||
```json
|
||||
{
|
||||
"urls": ["https://example.com"],
|
||||
"webhook_config": {
|
||||
"webhook_url": "https://myapp.com/webhooks/crawl",
|
||||
"webhook_data_in_payload": false,
|
||||
"webhook_headers": {
|
||||
"X-Webhook-Secret": "your-secret-token",
|
||||
"X-Service-ID": "crawl4ai-prod"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Global Default Webhook
|
||||
|
||||
Configure a default webhook URL in `config.yml` for all jobs:
|
||||
|
||||
```yaml
|
||||
webhooks:
|
||||
enabled: true
|
||||
default_url: "https://myapp.com/webhooks/default"
|
||||
data_in_payload: false
|
||||
retry:
|
||||
max_attempts: 5
|
||||
initial_delay_ms: 1000
|
||||
max_delay_ms: 32000
|
||||
timeout_ms: 30000
|
||||
```
|
||||
|
||||
Now jobs without `webhook_config` automatically use the default webhook.
|
||||
|
||||
#### Job Status Polling (Without Webhooks)
|
||||
|
||||
If you prefer polling instead of webhooks, just omit `webhook_config`:
|
||||
|
||||
```bash
|
||||
# Submit job
|
||||
curl -X POST http://localhost:11235/crawl/job \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"urls": ["https://example.com"]}'
|
||||
# Response: {"task_id": "crawl_xyz"}
|
||||
|
||||
# Poll for status
|
||||
curl http://localhost:11235/crawl/job/crawl_xyz
|
||||
```
|
||||
|
||||
The response includes `status` field: `"processing"`, `"completed"`, or `"failed"`.
|
||||
|
||||
> 💡 **Pro tip**: See [WEBHOOK_EXAMPLES.md](./WEBHOOK_EXAMPLES.md) for detailed examples including TypeScript client code, Flask webhook handlers, and failure handling.
|
||||
|
||||
---
|
||||
|
||||
## Metrics & Monitoring
|
||||
@@ -833,8 +692,7 @@ app:
|
||||
# Default LLM Configuration
|
||||
llm:
|
||||
provider: "openai/gpt-4o-mini" # Can be overridden by LLM_PROVIDER env var
|
||||
api_key_env: "OPENAI_API_KEY"
|
||||
# api_key: sk-... # If you pass the API key directly then api_key_env will be ignored
|
||||
# api_key: sk-... # If you pass the API key directly (not recommended)
|
||||
|
||||
# Redis Configuration (Used by internal Redis server managed by supervisord)
|
||||
redis:
|
||||
@@ -968,11 +826,10 @@ We're here to help you succeed with Crawl4AI! Here's how to get support:
|
||||
|
||||
In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
|
||||
- Building and running the Docker container
|
||||
- Configuring the environment
|
||||
- Configuring the environment
|
||||
- Using the interactive playground for testing
|
||||
- Making API requests with proper typing
|
||||
- Using the Python SDK
|
||||
- Asynchronous job queues with webhook notifications
|
||||
- Leveraging specialized endpoints for screenshots, PDFs, and JavaScript execution
|
||||
- Connecting via the Model Context Protocol (MCP)
|
||||
- Monitoring your deployment
|
||||
|
||||
@@ -1,378 +0,0 @@
|
||||
# Webhook Feature Examples
|
||||
|
||||
This document provides examples of how to use the webhook feature for crawl jobs in Crawl4AI.
|
||||
|
||||
## Overview
|
||||
|
||||
The webhook feature allows you to receive notifications when crawl jobs complete, eliminating the need for polling. Webhooks are sent with exponential backoff retry logic to ensure reliable delivery.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Global Configuration (config.yml)
|
||||
|
||||
You can configure default webhook settings in `config.yml`:
|
||||
|
||||
```yaml
|
||||
webhooks:
|
||||
enabled: true
|
||||
default_url: null # Optional: default webhook URL for all jobs
|
||||
data_in_payload: false # Optional: default behavior for including data
|
||||
retry:
|
||||
max_attempts: 5
|
||||
initial_delay_ms: 1000 # 1s, 2s, 4s, 8s, 16s exponential backoff
|
||||
max_delay_ms: 32000
|
||||
timeout_ms: 30000 # 30s timeout per webhook call
|
||||
headers: # Optional: default headers to include
|
||||
User-Agent: "Crawl4AI-Webhook/1.0"
|
||||
```
|
||||
|
||||
## API Usage Examples
|
||||
|
||||
### Example 1: Basic Webhook (Notification Only)
|
||||
|
||||
Send a webhook notification without including the crawl data in the payload.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
curl -X POST http://localhost:11235/crawl/job \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"urls": ["https://example.com"],
|
||||
"webhook_config": {
|
||||
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
|
||||
"webhook_data_in_payload": false
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"task_id": "crawl_a1b2c3d4"
|
||||
}
|
||||
```
|
||||
|
||||
**Webhook Payload Received:**
|
||||
```json
|
||||
{
|
||||
"task_id": "crawl_a1b2c3d4",
|
||||
"task_type": "crawl",
|
||||
"status": "completed",
|
||||
"timestamp": "2025-10-21T10:30:00.000000+00:00",
|
||||
"urls": ["https://example.com"]
|
||||
}
|
||||
```
|
||||
|
||||
Your webhook handler should then fetch the results:
|
||||
```bash
|
||||
curl http://localhost:11235/crawl/job/crawl_a1b2c3d4
|
||||
```
|
||||
|
||||
### Example 2: Webhook with Data Included
|
||||
|
||||
Include the full crawl results in the webhook payload.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
curl -X POST http://localhost:11235/crawl/job \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"urls": ["https://example.com"],
|
||||
"webhook_config": {
|
||||
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
|
||||
"webhook_data_in_payload": true
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
**Webhook Payload Received:**
|
||||
```json
|
||||
{
|
||||
"task_id": "crawl_a1b2c3d4",
|
||||
"task_type": "crawl",
|
||||
"status": "completed",
|
||||
"timestamp": "2025-10-21T10:30:00.000000+00:00",
|
||||
"urls": ["https://example.com"],
|
||||
"data": {
|
||||
"markdown": "...",
|
||||
"html": "...",
|
||||
"links": {...},
|
||||
"metadata": {...}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Example 3: Webhook with Custom Headers
|
||||
|
||||
Include custom headers for authentication or identification.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
curl -X POST http://localhost:11235/crawl/job \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"urls": ["https://example.com"],
|
||||
"webhook_config": {
|
||||
"webhook_url": "https://myapp.com/webhooks/crawl-complete",
|
||||
"webhook_data_in_payload": false,
|
||||
"webhook_headers": {
|
||||
"X-Webhook-Secret": "my-secret-token",
|
||||
"X-Service-ID": "crawl4ai-production"
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
The webhook will be sent with these additional headers plus the default headers from config.
|
||||
|
||||
### Example 4: Failure Notification
|
||||
|
||||
When a crawl job fails, a webhook is sent with error details.
|
||||
|
||||
**Webhook Payload on Failure:**
|
||||
```json
|
||||
{
|
||||
"task_id": "crawl_a1b2c3d4",
|
||||
"task_type": "crawl",
|
||||
"status": "failed",
|
||||
"timestamp": "2025-10-21T10:30:00.000000+00:00",
|
||||
"urls": ["https://example.com"],
|
||||
"error": "Connection timeout after 30s"
|
||||
}
|
||||
```
|
||||
|
||||
### Example 5: Using Global Default Webhook
|
||||
|
||||
If you set a `default_url` in config.yml, jobs without webhook_config will use it:
|
||||
|
||||
**config.yml:**
|
||||
```yaml
|
||||
webhooks:
|
||||
enabled: true
|
||||
default_url: "https://myapp.com/webhooks/default"
|
||||
data_in_payload: false
|
||||
```
|
||||
|
||||
**Request (no webhook_config needed):**
|
||||
```bash
|
||||
curl -X POST http://localhost:11235/crawl/job \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"urls": ["https://example.com"]
|
||||
}'
|
||||
```
|
||||
|
||||
The webhook will be sent to the default URL configured in config.yml.
|
||||
|
||||
### Example 6: LLM Extraction Job with Webhook
|
||||
|
||||
Use webhooks with the LLM extraction endpoint for asynchronous processing.
|
||||
|
||||
**Request:**
|
||||
```bash
|
||||
curl -X POST http://localhost:11235/llm/job \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"url": "https://example.com/article",
|
||||
"q": "Extract the article title, author, and publication date",
|
||||
"schema": "{\"type\": \"object\", \"properties\": {\"title\": {\"type\": \"string\"}, \"author\": {\"type\": \"string\"}, \"date\": {\"type\": \"string\"}}}",
|
||||
"cache": false,
|
||||
"provider": "openai/gpt-4o-mini",
|
||||
"webhook_config": {
|
||||
"webhook_url": "https://myapp.com/webhooks/llm-complete",
|
||||
"webhook_data_in_payload": true
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"task_id": "llm_1698765432_12345"
|
||||
}
|
||||
```
|
||||
|
||||
**Webhook Payload Received:**
|
||||
```json
|
||||
{
|
||||
"task_id": "llm_1698765432_12345",
|
||||
"task_type": "llm_extraction",
|
||||
"status": "completed",
|
||||
"timestamp": "2025-10-21T10:30:00.000000+00:00",
|
||||
"urls": ["https://example.com/article"],
|
||||
"data": {
|
||||
"extracted_content": {
|
||||
"title": "Understanding Web Scraping",
|
||||
"author": "John Doe",
|
||||
"date": "2025-10-21"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Webhook Handler Example
|
||||
|
||||
Here's a simple Python Flask webhook handler that supports both crawl and LLM extraction jobs:
|
||||
|
||||
```python
|
||||
from flask import Flask, request, jsonify
|
||||
import requests
|
||||
|
||||
app = Flask(__name__)
|
||||
|
||||
@app.route('/webhooks/crawl-complete', methods=['POST'])
|
||||
def handle_crawl_webhook():
|
||||
payload = request.json
|
||||
|
||||
task_id = payload['task_id']
|
||||
task_type = payload['task_type']
|
||||
status = payload['status']
|
||||
|
||||
if status == 'completed':
|
||||
# If data not in payload, fetch it
|
||||
if 'data' not in payload:
|
||||
# Determine endpoint based on task type
|
||||
endpoint = 'crawl' if task_type == 'crawl' else 'llm'
|
||||
response = requests.get(f'http://localhost:11235/{endpoint}/job/{task_id}')
|
||||
data = response.json()
|
||||
else:
|
||||
data = payload['data']
|
||||
|
||||
# Process based on task type
|
||||
if task_type == 'crawl':
|
||||
print(f"Processing crawl results for {task_id}")
|
||||
# Handle crawl results
|
||||
results = data.get('results', [])
|
||||
for result in results:
|
||||
print(f" - {result.get('url')}: {len(result.get('markdown', ''))} chars")
|
||||
|
||||
elif task_type == 'llm_extraction':
|
||||
print(f"Processing LLM extraction for {task_id}")
|
||||
# Handle LLM extraction
|
||||
# Note: Webhook sends 'extracted_content', API returns 'result'
|
||||
extracted = data.get('extracted_content', data.get('result', {}))
|
||||
print(f" - Extracted: {extracted}")
|
||||
|
||||
# Your business logic here...
|
||||
|
||||
elif status == 'failed':
|
||||
error = payload.get('error', 'Unknown error')
|
||||
print(f"{task_type} job {task_id} failed: {error}")
|
||||
# Handle failure...
|
||||
|
||||
return jsonify({"status": "received"}), 200
|
||||
|
||||
if __name__ == '__main__':
|
||||
app.run(port=8080)
|
||||
```
|
||||
|
||||
## Retry Logic
|
||||
|
||||
The webhook delivery service uses exponential backoff retry logic:
|
||||
|
||||
- **Attempts:** Up to 5 attempts by default
|
||||
- **Delays:** 1s → 2s → 4s → 8s → 16s
|
||||
- **Timeout:** 30 seconds per attempt
|
||||
- **Retry Conditions:**
|
||||
- Server errors (5xx status codes)
|
||||
- Network errors
|
||||
- Timeouts
|
||||
- **No Retry:**
|
||||
- Client errors (4xx status codes)
|
||||
- Successful delivery (2xx status codes)
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **No Polling Required** - Eliminates constant API calls to check job status
|
||||
2. **Real-time Notifications** - Immediate notification when jobs complete
|
||||
3. **Reliable Delivery** - Exponential backoff ensures webhooks are delivered
|
||||
4. **Flexible** - Choose between notification-only or full data delivery
|
||||
5. **Secure** - Support for custom headers for authentication
|
||||
6. **Configurable** - Global defaults or per-job configuration
|
||||
7. **Universal Support** - Works with both `/crawl/job` and `/llm/job` endpoints
|
||||
|
||||
## TypeScript Client Example
|
||||
|
||||
```typescript
|
||||
interface WebhookConfig {
|
||||
webhook_url: string;
|
||||
webhook_data_in_payload?: boolean;
|
||||
webhook_headers?: Record<string, string>;
|
||||
}
|
||||
|
||||
interface CrawlJobRequest {
|
||||
urls: string[];
|
||||
browser_config?: Record<string, any>;
|
||||
crawler_config?: Record<string, any>;
|
||||
webhook_config?: WebhookConfig;
|
||||
}
|
||||
|
||||
interface LLMJobRequest {
|
||||
url: string;
|
||||
q: string;
|
||||
schema?: string;
|
||||
cache?: boolean;
|
||||
provider?: string;
|
||||
webhook_config?: WebhookConfig;
|
||||
}
|
||||
|
||||
async function createCrawlJob(request: CrawlJobRequest) {
|
||||
const response = await fetch('http://localhost:11235/crawl/job', {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify(request)
|
||||
});
|
||||
|
||||
const { task_id } = await response.json();
|
||||
return task_id;
|
||||
}
|
||||
|
||||
async function createLLMJob(request: LLMJobRequest) {
|
||||
const response = await fetch('http://localhost:11235/llm/job', {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify(request)
|
||||
});
|
||||
|
||||
const { task_id } = await response.json();
|
||||
return task_id;
|
||||
}
|
||||
|
||||
// Usage - Crawl Job
|
||||
const crawlTaskId = await createCrawlJob({
|
||||
urls: ['https://example.com'],
|
||||
webhook_config: {
|
||||
webhook_url: 'https://myapp.com/webhooks/crawl-complete',
|
||||
webhook_data_in_payload: false,
|
||||
webhook_headers: {
|
||||
'X-Webhook-Secret': 'my-secret'
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
// Usage - LLM Extraction Job
|
||||
const llmTaskId = await createLLMJob({
|
||||
url: 'https://example.com/article',
|
||||
q: 'Extract the main points from this article',
|
||||
provider: 'openai/gpt-4o-mini',
|
||||
webhook_config: {
|
||||
webhook_url: 'https://myapp.com/webhooks/llm-complete',
|
||||
webhook_data_in_payload: true,
|
||||
webhook_headers: {
|
||||
'X-Webhook-Secret': 'my-secret'
|
||||
}
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
## Monitoring and Debugging
|
||||
|
||||
Webhook delivery attempts are logged at INFO level:
|
||||
- Successful deliveries
|
||||
- Retry attempts with delays
|
||||
- Final failures after max attempts
|
||||
|
||||
Check the application logs for webhook delivery status:
|
||||
```bash
|
||||
docker logs crawl4ai-container | grep -i webhook
|
||||
```
|
||||
@@ -4,7 +4,7 @@ import asyncio
|
||||
from typing import List, Tuple, Dict
|
||||
from functools import partial
|
||||
from uuid import uuid4
|
||||
from datetime import datetime
|
||||
from datetime import datetime, timezone
|
||||
from base64 import b64encode
|
||||
|
||||
import logging
|
||||
@@ -42,9 +42,10 @@ from utils import (
|
||||
should_cleanup_task,
|
||||
decode_redis_hash,
|
||||
get_llm_api_key,
|
||||
validate_llm_provider
|
||||
validate_llm_provider,
|
||||
get_llm_temperature,
|
||||
get_llm_base_url
|
||||
)
|
||||
from webhook import WebhookDeliveryService
|
||||
|
||||
import psutil, time
|
||||
|
||||
@@ -97,7 +98,9 @@ async def handle_llm_qa(
|
||||
response = perform_completion_with_backoff(
|
||||
provider=config["llm"]["provider"],
|
||||
prompt_with_variables=prompt,
|
||||
api_token=get_llm_api_key(config)
|
||||
api_token=get_llm_api_key(config), # Returns None to let litellm handle it
|
||||
temperature=get_llm_temperature(config),
|
||||
base_url=get_llm_base_url(config)
|
||||
)
|
||||
|
||||
return response.choices[0].message.content
|
||||
@@ -117,12 +120,10 @@ async def process_llm_extraction(
|
||||
schema: Optional[str] = None,
|
||||
cache: str = "0",
|
||||
provider: Optional[str] = None,
|
||||
webhook_config: Optional[Dict] = None
|
||||
temperature: Optional[float] = None,
|
||||
base_url: Optional[str] = None
|
||||
) -> None:
|
||||
"""Process LLM extraction in background."""
|
||||
# Initialize webhook service
|
||||
webhook_service = WebhookDeliveryService(config)
|
||||
|
||||
try:
|
||||
# Validate provider
|
||||
is_valid, error_msg = validate_llm_provider(config, provider)
|
||||
@@ -131,22 +132,14 @@ async def process_llm_extraction(
|
||||
"status": TaskStatus.FAILED,
|
||||
"error": error_msg
|
||||
})
|
||||
|
||||
# Send webhook notification on failure
|
||||
await webhook_service.notify_job_completion(
|
||||
task_id=task_id,
|
||||
task_type="llm_extraction",
|
||||
status="failed",
|
||||
urls=[url],
|
||||
webhook_config=webhook_config,
|
||||
error=error_msg
|
||||
)
|
||||
return
|
||||
api_key = get_llm_api_key(config, provider)
|
||||
api_key = get_llm_api_key(config, provider) # Returns None to let litellm handle it
|
||||
llm_strategy = LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(
|
||||
provider=provider or config["llm"]["provider"],
|
||||
api_token=api_key
|
||||
api_token=api_key,
|
||||
temperature=temperature or get_llm_temperature(config, provider),
|
||||
base_url=base_url or get_llm_base_url(config, provider)
|
||||
),
|
||||
instruction=instruction,
|
||||
schema=json.loads(schema) if schema else None,
|
||||
@@ -169,40 +162,17 @@ async def process_llm_extraction(
|
||||
"status": TaskStatus.FAILED,
|
||||
"error": result.error_message
|
||||
})
|
||||
|
||||
# Send webhook notification on failure
|
||||
await webhook_service.notify_job_completion(
|
||||
task_id=task_id,
|
||||
task_type="llm_extraction",
|
||||
status="failed",
|
||||
urls=[url],
|
||||
webhook_config=webhook_config,
|
||||
error=result.error_message
|
||||
)
|
||||
return
|
||||
|
||||
try:
|
||||
content = json.loads(result.extracted_content)
|
||||
except json.JSONDecodeError:
|
||||
content = result.extracted_content
|
||||
|
||||
result_data = {"extracted_content": content}
|
||||
|
||||
await redis.hset(f"task:{task_id}", mapping={
|
||||
"status": TaskStatus.COMPLETED,
|
||||
"result": json.dumps(content)
|
||||
})
|
||||
|
||||
# Send webhook notification on successful completion
|
||||
await webhook_service.notify_job_completion(
|
||||
task_id=task_id,
|
||||
task_type="llm_extraction",
|
||||
status="completed",
|
||||
urls=[url],
|
||||
webhook_config=webhook_config,
|
||||
result=result_data
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"LLM extraction error: {str(e)}", exc_info=True)
|
||||
await redis.hset(f"task:{task_id}", mapping={
|
||||
@@ -210,23 +180,15 @@ async def process_llm_extraction(
|
||||
"error": str(e)
|
||||
})
|
||||
|
||||
# Send webhook notification on failure
|
||||
await webhook_service.notify_job_completion(
|
||||
task_id=task_id,
|
||||
task_type="llm_extraction",
|
||||
status="failed",
|
||||
urls=[url],
|
||||
webhook_config=webhook_config,
|
||||
error=str(e)
|
||||
)
|
||||
|
||||
async def handle_markdown_request(
|
||||
url: str,
|
||||
filter_type: FilterType,
|
||||
query: Optional[str] = None,
|
||||
cache: str = "0",
|
||||
config: Optional[dict] = None,
|
||||
provider: Optional[str] = None
|
||||
provider: Optional[str] = None,
|
||||
temperature: Optional[float] = None,
|
||||
base_url: Optional[str] = None
|
||||
) -> str:
|
||||
"""Handle markdown generation requests."""
|
||||
try:
|
||||
@@ -251,7 +213,9 @@ async def handle_markdown_request(
|
||||
FilterType.LLM: LLMContentFilter(
|
||||
llm_config=LLMConfig(
|
||||
provider=provider or config["llm"]["provider"],
|
||||
api_token=get_llm_api_key(config, provider),
|
||||
api_token=get_llm_api_key(config, provider), # Returns None to let litellm handle it
|
||||
temperature=temperature or get_llm_temperature(config, provider),
|
||||
base_url=base_url or get_llm_base_url(config, provider)
|
||||
),
|
||||
instruction=query or "Extract main content"
|
||||
)
|
||||
@@ -297,7 +261,8 @@ async def handle_llm_request(
|
||||
cache: str = "0",
|
||||
config: Optional[dict] = None,
|
||||
provider: Optional[str] = None,
|
||||
webhook_config: Optional[Dict] = None,
|
||||
temperature: Optional[float] = None,
|
||||
api_base_url: Optional[str] = None
|
||||
) -> JSONResponse:
|
||||
"""Handle LLM extraction requests."""
|
||||
base_url = get_base_url(request)
|
||||
@@ -329,7 +294,8 @@ async def handle_llm_request(
|
||||
base_url,
|
||||
config,
|
||||
provider,
|
||||
webhook_config
|
||||
temperature,
|
||||
api_base_url
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
@@ -375,7 +341,8 @@ async def create_new_task(
|
||||
base_url: str,
|
||||
config: dict,
|
||||
provider: Optional[str] = None,
|
||||
webhook_config: Optional[Dict] = None
|
||||
temperature: Optional[float] = None,
|
||||
api_base_url: Optional[str] = None
|
||||
) -> JSONResponse:
|
||||
"""Create and initialize a new task."""
|
||||
decoded_url = unquote(input_path)
|
||||
@@ -384,18 +351,12 @@ async def create_new_task(
|
||||
|
||||
from datetime import datetime
|
||||
task_id = f"llm_{int(datetime.now().timestamp())}_{id(background_tasks)}"
|
||||
|
||||
task_data = {
|
||||
|
||||
await redis.hset(f"task:{task_id}", mapping={
|
||||
"status": TaskStatus.PROCESSING,
|
||||
"created_at": datetime.now().isoformat(),
|
||||
"url": decoded_url
|
||||
}
|
||||
|
||||
# Store webhook config if provided
|
||||
if webhook_config:
|
||||
task_data["webhook_config"] = json.dumps(webhook_config)
|
||||
|
||||
await redis.hset(f"task:{task_id}", mapping=task_data)
|
||||
})
|
||||
|
||||
background_tasks.add_task(
|
||||
process_llm_extraction,
|
||||
@@ -407,7 +368,8 @@ async def create_new_task(
|
||||
schema,
|
||||
cache,
|
||||
provider,
|
||||
webhook_config
|
||||
temperature,
|
||||
api_base_url
|
||||
)
|
||||
|
||||
return JSONResponse({
|
||||
@@ -451,6 +413,9 @@ async def stream_results(crawler: AsyncWebCrawler, results_gen: AsyncGenerator)
|
||||
server_memory_mb = _get_memory_mb()
|
||||
result_dict = result.model_dump()
|
||||
result_dict['server_memory_mb'] = server_memory_mb
|
||||
# Ensure fit_html is JSON-serializable
|
||||
if "fit_html" in result_dict and not (result_dict["fit_html"] is None or isinstance(result_dict["fit_html"], str)):
|
||||
result_dict["fit_html"] = None
|
||||
# If PDF exists, encode it to base64
|
||||
if result_dict.get('pdf') is not None:
|
||||
result_dict['pdf'] = b64encode(result_dict['pdf']).decode('utf-8')
|
||||
@@ -531,6 +496,9 @@ async def handle_crawl_request(
|
||||
processed_results = []
|
||||
for result in results:
|
||||
result_dict = result.model_dump()
|
||||
# if fit_html is not a string, set it to None to avoid serialization errors
|
||||
if "fit_html" in result_dict and not (result_dict["fit_html"] is None or isinstance(result_dict["fit_html"], str)):
|
||||
result_dict["fit_html"] = None
|
||||
# If PDF exists, encode it to base64
|
||||
if result_dict.get('pdf') is not None:
|
||||
result_dict['pdf'] = b64encode(result_dict['pdf']).decode('utf-8')
|
||||
@@ -625,7 +593,6 @@ async def handle_crawl_job(
|
||||
browser_config: Dict,
|
||||
crawler_config: Dict,
|
||||
config: Dict,
|
||||
webhook_config: Optional[Dict] = None,
|
||||
) -> Dict:
|
||||
"""
|
||||
Fire-and-forget version of handle_crawl_request.
|
||||
@@ -633,24 +600,13 @@ async def handle_crawl_job(
|
||||
lets /crawl/job/{task_id} polling fetch the result.
|
||||
"""
|
||||
task_id = f"crawl_{uuid4().hex[:8]}"
|
||||
|
||||
# Store task data in Redis
|
||||
task_data = {
|
||||
await redis.hset(f"task:{task_id}", mapping={
|
||||
"status": TaskStatus.PROCESSING, # <-- keep enum values consistent
|
||||
"created_at": datetime.utcnow().isoformat(),
|
||||
"created_at": datetime.now(timezone.utc).replace(tzinfo=None).isoformat(),
|
||||
"url": json.dumps(urls), # store list as JSON string
|
||||
"result": "",
|
||||
"error": "",
|
||||
}
|
||||
|
||||
# Store webhook config if provided
|
||||
if webhook_config:
|
||||
task_data["webhook_config"] = json.dumps(webhook_config)
|
||||
|
||||
await redis.hset(f"task:{task_id}", mapping=task_data)
|
||||
|
||||
# Initialize webhook service
|
||||
webhook_service = WebhookDeliveryService(config)
|
||||
})
|
||||
|
||||
async def _runner():
|
||||
try:
|
||||
@@ -664,17 +620,6 @@ async def handle_crawl_job(
|
||||
"status": TaskStatus.COMPLETED,
|
||||
"result": json.dumps(result),
|
||||
})
|
||||
|
||||
# Send webhook notification on successful completion
|
||||
await webhook_service.notify_job_completion(
|
||||
task_id=task_id,
|
||||
task_type="crawl",
|
||||
status="completed",
|
||||
urls=urls,
|
||||
webhook_config=webhook_config,
|
||||
result=result
|
||||
)
|
||||
|
||||
await asyncio.sleep(5) # Give Redis time to process the update
|
||||
except Exception as exc:
|
||||
await redis.hset(f"task:{task_id}", mapping={
|
||||
@@ -682,15 +627,5 @@ async def handle_crawl_job(
|
||||
"error": str(exc),
|
||||
})
|
||||
|
||||
# Send webhook notification on failure
|
||||
await webhook_service.notify_job_completion(
|
||||
task_id=task_id,
|
||||
task_type="crawl",
|
||||
status="failed",
|
||||
urls=urls,
|
||||
webhook_config=webhook_config,
|
||||
error=str(exc)
|
||||
)
|
||||
|
||||
background_tasks.add_task(_runner)
|
||||
return {"task_id": task_id}
|
||||
@@ -28,25 +28,43 @@ def create_access_token(data: dict, expires_delta: Optional[timedelta] = None) -
|
||||
signing_key = get_jwk_from_secret(SECRET_KEY)
|
||||
return instance.encode(to_encode, signing_key, alg='HS256')
|
||||
|
||||
def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)) -> Dict:
|
||||
def verify_token(credentials: HTTPAuthorizationCredentials) -> Dict:
|
||||
"""Verify the JWT token from the Authorization header."""
|
||||
|
||||
if credentials is None:
|
||||
return None
|
||||
|
||||
if not credentials or not credentials.credentials:
|
||||
raise HTTPException(
|
||||
status_code=401,
|
||||
detail="No token provided",
|
||||
headers={"WWW-Authenticate": "Bearer"}
|
||||
)
|
||||
|
||||
token = credentials.credentials
|
||||
verifying_key = get_jwk_from_secret(SECRET_KEY)
|
||||
try:
|
||||
payload = instance.decode(token, verifying_key, do_time_check=True, algorithms='HS256')
|
||||
return payload
|
||||
except Exception:
|
||||
raise HTTPException(status_code=401, detail="Invalid or expired token")
|
||||
except Exception as e:
|
||||
raise HTTPException(
|
||||
status_code=401,
|
||||
detail=f"Invalid or expired token: {str(e)}",
|
||||
headers={"WWW-Authenticate": "Bearer"}
|
||||
)
|
||||
|
||||
|
||||
def get_token_dependency(config: Dict):
|
||||
"""Return the token dependency if JWT is enabled, else a function that returns None."""
|
||||
|
||||
|
||||
if config.get("security", {}).get("jwt_enabled", False):
|
||||
return verify_token
|
||||
def jwt_required(credentials: HTTPAuthorizationCredentials = Depends(security)) -> Dict:
|
||||
"""Enforce JWT authentication when enabled."""
|
||||
if credentials is None:
|
||||
raise HTTPException(
|
||||
status_code=401,
|
||||
detail="Authentication required. Please provide a valid Bearer token.",
|
||||
headers={"WWW-Authenticate": "Bearer"}
|
||||
)
|
||||
return verify_token(credentials)
|
||||
return jwt_required
|
||||
else:
|
||||
return lambda: None
|
||||
|
||||
|
||||
@@ -2241,7 +2241,7 @@ docker build -t crawl4ai
|
||||
|
||||
| Argument | Description | Default | Options |
|
||||
|----------|-------------|---------|----------|
|
||||
| PYTHON_VERSION | Python version | 3.10 | 3.8, 3.9, 3.10 |
|
||||
| PYTHON_VERSION | Python version | 3.10 | 3.10, 3.11, 3.12, 3.13 |
|
||||
| INSTALL_TYPE | Feature set | default | default, all, torch, transformer |
|
||||
| ENABLE_GPU | GPU support | false | true, false |
|
||||
| APP_HOME | Install path | /app | any valid path |
|
||||
|
||||
@@ -11,8 +11,7 @@ app:
|
||||
# Default LLM Configuration
|
||||
llm:
|
||||
provider: "openai/gpt-4o-mini"
|
||||
api_key_env: "OPENAI_API_KEY"
|
||||
# api_key: sk-... # If you pass the API key directly then api_key_env will be ignored
|
||||
# api_key: sk-... # If you pass the API key directly (not recommended)
|
||||
|
||||
# Redis Configuration
|
||||
redis:
|
||||
@@ -39,8 +38,8 @@ rate_limiting:
|
||||
|
||||
# Security Configuration
|
||||
security:
|
||||
enabled: false
|
||||
jwt_enabled: false
|
||||
enabled: false
|
||||
jwt_enabled: false
|
||||
https_redirect: false
|
||||
trusted_hosts: ["*"]
|
||||
headers:
|
||||
@@ -88,17 +87,4 @@ observability:
|
||||
enabled: True
|
||||
endpoint: "/metrics"
|
||||
health_check:
|
||||
endpoint: "/health"
|
||||
|
||||
# Webhook Configuration
|
||||
webhooks:
|
||||
enabled: true
|
||||
default_url: null # Optional: default webhook URL for all jobs
|
||||
data_in_payload: false # Optional: default behavior for including data
|
||||
retry:
|
||||
max_attempts: 5
|
||||
initial_delay_ms: 1000 # 1s, 2s, 4s, 8s, 16s exponential backoff
|
||||
max_delay_ms: 32000
|
||||
timeout_ms: 30000 # 30s timeout per webhook call
|
||||
headers: # Optional: default headers to include
|
||||
User-Agent: "Crawl4AI-Webhook/1.0"
|
||||
endpoint: "/health"
|
||||
@@ -12,7 +12,6 @@ from api import (
|
||||
handle_crawl_job,
|
||||
handle_task_status,
|
||||
)
|
||||
from schemas import WebhookConfig
|
||||
|
||||
# ------------- dependency placeholders -------------
|
||||
_redis = None # will be injected from server.py
|
||||
@@ -38,14 +37,14 @@ class LlmJobPayload(BaseModel):
|
||||
schema: Optional[str] = None
|
||||
cache: bool = False
|
||||
provider: Optional[str] = None
|
||||
webhook_config: Optional[WebhookConfig] = None
|
||||
temperature: Optional[float] = None
|
||||
base_url: Optional[str] = None
|
||||
|
||||
|
||||
class CrawlJobPayload(BaseModel):
|
||||
urls: list[HttpUrl]
|
||||
browser_config: Dict = {}
|
||||
crawler_config: Dict = {}
|
||||
webhook_config: Optional[WebhookConfig] = None
|
||||
|
||||
|
||||
# ---------- LLM job ---------------------------------------------------------
|
||||
@@ -56,10 +55,6 @@ async def llm_job_enqueue(
|
||||
request: Request,
|
||||
_td: Dict = Depends(lambda: _token_dep()), # late-bound dep
|
||||
):
|
||||
webhook_config = None
|
||||
if payload.webhook_config:
|
||||
webhook_config = payload.webhook_config.model_dump(mode='json')
|
||||
|
||||
return await handle_llm_request(
|
||||
_redis,
|
||||
background_tasks,
|
||||
@@ -70,7 +65,8 @@ async def llm_job_enqueue(
|
||||
cache=payload.cache,
|
||||
config=_config,
|
||||
provider=payload.provider,
|
||||
webhook_config=webhook_config,
|
||||
temperature=payload.temperature,
|
||||
api_base_url=payload.base_url,
|
||||
)
|
||||
|
||||
|
||||
@@ -90,10 +86,6 @@ async def crawl_job_enqueue(
|
||||
background_tasks: BackgroundTasks,
|
||||
_td: Dict = Depends(lambda: _token_dep()),
|
||||
):
|
||||
webhook_config = None
|
||||
if payload.webhook_config:
|
||||
webhook_config = payload.webhook_config.model_dump(mode='json')
|
||||
|
||||
return await handle_crawl_job(
|
||||
_redis,
|
||||
background_tasks,
|
||||
@@ -101,7 +93,6 @@ async def crawl_job_enqueue(
|
||||
payload.browser_config,
|
||||
payload.crawler_config,
|
||||
config=_config,
|
||||
webhook_config=webhook_config,
|
||||
)
|
||||
|
||||
|
||||
|
||||
@@ -12,6 +12,6 @@ pydantic>=2.11
|
||||
rank-bm25==0.2.2
|
||||
anyio==4.9.0
|
||||
PyJWT==2.10.1
|
||||
mcp>=1.18.0
|
||||
mcp>=1.6.0
|
||||
websockets>=15.0.1
|
||||
httpx[http2]>=0.27.2
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
from typing import List, Optional, Dict
|
||||
from enum import Enum
|
||||
from pydantic import BaseModel, Field, HttpUrl
|
||||
from pydantic import BaseModel, Field
|
||||
from utils import FilterType
|
||||
|
||||
|
||||
@@ -16,6 +16,8 @@ class MarkdownRequest(BaseModel):
|
||||
q: Optional[str] = Field(None, description="Query string used by BM25/LLM filters")
|
||||
c: Optional[str] = Field("0", description="Cache‑bust / revision counter")
|
||||
provider: Optional[str] = Field(None, description="LLM provider override (e.g., 'anthropic/claude-3-opus')")
|
||||
temperature: Optional[float] = Field(None, description="LLM temperature override (0.0-2.0)")
|
||||
base_url: Optional[str] = Field(None, description="LLM API base URL override")
|
||||
|
||||
|
||||
class RawCode(BaseModel):
|
||||
@@ -39,22 +41,4 @@ class JSEndpointRequest(BaseModel):
|
||||
scripts: List[str] = Field(
|
||||
...,
|
||||
description="List of separated JavaScript snippets to execute"
|
||||
)
|
||||
|
||||
|
||||
class WebhookConfig(BaseModel):
|
||||
"""Configuration for webhook notifications."""
|
||||
webhook_url: HttpUrl
|
||||
webhook_data_in_payload: bool = False
|
||||
webhook_headers: Optional[Dict[str, str]] = None
|
||||
|
||||
|
||||
class WebhookPayload(BaseModel):
|
||||
"""Payload sent to webhook endpoints."""
|
||||
task_id: str
|
||||
task_type: str # "crawl", "llm_extraction", etc.
|
||||
status: str # "completed" or "failed"
|
||||
timestamp: str # ISO 8601 format
|
||||
urls: List[str]
|
||||
error: Optional[str] = None
|
||||
data: Optional[Dict] = None # Included only if webhook_data_in_payload=True
|
||||
)
|
||||
@@ -241,7 +241,8 @@ async def get_markdown(
|
||||
raise HTTPException(
|
||||
400, "Invalid URL format. Must start with http://, https://, or for raw HTML (raw:, raw://)")
|
||||
markdown = await handle_markdown_request(
|
||||
body.url, body.f, body.q, body.c, config, body.provider
|
||||
body.url, body.f, body.q, body.c, config, body.provider,
|
||||
body.temperature, body.base_url
|
||||
)
|
||||
return JSONResponse({
|
||||
"url": body.url,
|
||||
@@ -266,12 +267,26 @@ async def generate_html(
|
||||
Use when you need sanitized HTML structures for building schemas or further processing.
|
||||
"""
|
||||
cfg = CrawlerRunConfig()
|
||||
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
|
||||
results = await crawler.arun(url=body.url, config=cfg)
|
||||
raw_html = results[0].html
|
||||
from crawl4ai.utils import preprocess_html_for_schema
|
||||
processed_html = preprocess_html_for_schema(raw_html)
|
||||
return JSONResponse({"html": processed_html, "url": body.url, "success": True})
|
||||
try:
|
||||
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
|
||||
results = await crawler.arun(url=body.url, config=cfg)
|
||||
# Check if the crawl was successful
|
||||
if not results[0].success:
|
||||
raise HTTPException(
|
||||
status_code=500,
|
||||
detail=results[0].error_message or "Crawl failed"
|
||||
)
|
||||
|
||||
raw_html = results[0].html
|
||||
from crawl4ai.utils import preprocess_html_for_schema
|
||||
processed_html = preprocess_html_for_schema(raw_html)
|
||||
return JSONResponse({"html": processed_html, "url": body.url, "success": True})
|
||||
except Exception as e:
|
||||
# Log and raise as HTTP 500 for other exceptions
|
||||
raise HTTPException(
|
||||
status_code=500,
|
||||
detail=str(e)
|
||||
)
|
||||
|
||||
# Screenshot endpoint
|
||||
|
||||
@@ -289,18 +304,29 @@ async def generate_screenshot(
|
||||
Use when you need an image snapshot of the rendered page. Its recommened to provide an output path to save the screenshot.
|
||||
Then in result instead of the screenshot you will get a path to the saved file.
|
||||
"""
|
||||
cfg = CrawlerRunConfig(
|
||||
screenshot=True, screenshot_wait_for=body.screenshot_wait_for)
|
||||
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
|
||||
results = await crawler.arun(url=body.url, config=cfg)
|
||||
screenshot_data = results[0].screenshot
|
||||
if body.output_path:
|
||||
abs_path = os.path.abspath(body.output_path)
|
||||
os.makedirs(os.path.dirname(abs_path), exist_ok=True)
|
||||
with open(abs_path, "wb") as f:
|
||||
f.write(base64.b64decode(screenshot_data))
|
||||
return {"success": True, "path": abs_path}
|
||||
return {"success": True, "screenshot": screenshot_data}
|
||||
try:
|
||||
cfg = CrawlerRunConfig(
|
||||
screenshot=True, screenshot_wait_for=body.screenshot_wait_for)
|
||||
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
|
||||
results = await crawler.arun(url=body.url, config=cfg)
|
||||
if not results[0].success:
|
||||
raise HTTPException(
|
||||
status_code=500,
|
||||
detail=results[0].error_message or "Crawl failed"
|
||||
)
|
||||
screenshot_data = results[0].screenshot
|
||||
if body.output_path:
|
||||
abs_path = os.path.abspath(body.output_path)
|
||||
os.makedirs(os.path.dirname(abs_path), exist_ok=True)
|
||||
with open(abs_path, "wb") as f:
|
||||
f.write(base64.b64decode(screenshot_data))
|
||||
return {"success": True, "path": abs_path}
|
||||
return {"success": True, "screenshot": screenshot_data}
|
||||
except Exception as e:
|
||||
raise HTTPException(
|
||||
status_code=500,
|
||||
detail=str(e)
|
||||
)
|
||||
|
||||
# PDF endpoint
|
||||
|
||||
@@ -318,17 +344,28 @@ async def generate_pdf(
|
||||
Use when you need a printable or archivable snapshot of the page. It is recommended to provide an output path to save the PDF.
|
||||
Then in result instead of the PDF you will get a path to the saved file.
|
||||
"""
|
||||
cfg = CrawlerRunConfig(pdf=True)
|
||||
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
|
||||
results = await crawler.arun(url=body.url, config=cfg)
|
||||
pdf_data = results[0].pdf
|
||||
if body.output_path:
|
||||
abs_path = os.path.abspath(body.output_path)
|
||||
os.makedirs(os.path.dirname(abs_path), exist_ok=True)
|
||||
with open(abs_path, "wb") as f:
|
||||
f.write(pdf_data)
|
||||
return {"success": True, "path": abs_path}
|
||||
return {"success": True, "pdf": base64.b64encode(pdf_data).decode()}
|
||||
try:
|
||||
cfg = CrawlerRunConfig(pdf=True)
|
||||
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
|
||||
results = await crawler.arun(url=body.url, config=cfg)
|
||||
if not results[0].success:
|
||||
raise HTTPException(
|
||||
status_code=500,
|
||||
detail=results[0].error_message or "Crawl failed"
|
||||
)
|
||||
pdf_data = results[0].pdf
|
||||
if body.output_path:
|
||||
abs_path = os.path.abspath(body.output_path)
|
||||
os.makedirs(os.path.dirname(abs_path), exist_ok=True)
|
||||
with open(abs_path, "wb") as f:
|
||||
f.write(pdf_data)
|
||||
return {"success": True, "path": abs_path}
|
||||
return {"success": True, "pdf": base64.b64encode(pdf_data).decode()}
|
||||
except Exception as e:
|
||||
raise HTTPException(
|
||||
status_code=500,
|
||||
detail=str(e)
|
||||
)
|
||||
|
||||
|
||||
@app.post("/execute_js")
|
||||
@@ -384,12 +421,23 @@ async def execute_js(
|
||||
```
|
||||
|
||||
"""
|
||||
cfg = CrawlerRunConfig(js_code=body.scripts)
|
||||
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
|
||||
results = await crawler.arun(url=body.url, config=cfg)
|
||||
# Return JSON-serializable dict of the first CrawlResult
|
||||
data = results[0].model_dump()
|
||||
return JSONResponse(data)
|
||||
try:
|
||||
cfg = CrawlerRunConfig(js_code=body.scripts)
|
||||
async with AsyncWebCrawler(config=BrowserConfig()) as crawler:
|
||||
results = await crawler.arun(url=body.url, config=cfg)
|
||||
if not results[0].success:
|
||||
raise HTTPException(
|
||||
status_code=500,
|
||||
detail=results[0].error_message or "Crawl failed"
|
||||
)
|
||||
# Return JSON-serializable dict of the first CrawlResult
|
||||
data = results[0].model_dump()
|
||||
return JSONResponse(data)
|
||||
except Exception as e:
|
||||
raise HTTPException(
|
||||
status_code=500,
|
||||
detail=str(e)
|
||||
)
|
||||
|
||||
|
||||
@app.get("/llm/{url:path}")
|
||||
@@ -437,13 +485,16 @@ async def crawl(
|
||||
"""
|
||||
if not crawl_request.urls:
|
||||
raise HTTPException(400, "At least one URL required")
|
||||
res = await handle_crawl_request(
|
||||
results = await handle_crawl_request(
|
||||
urls=crawl_request.urls,
|
||||
browser_config=crawl_request.browser_config,
|
||||
crawler_config=crawl_request.crawler_config,
|
||||
config=config,
|
||||
)
|
||||
return JSONResponse(res)
|
||||
# check if all of the results are not successful
|
||||
if all(not result["success"] for result in results["results"]):
|
||||
raise HTTPException(500, f"Crawl request failed: {results['results'][0]['error_message']}")
|
||||
return JSONResponse(results)
|
||||
|
||||
|
||||
@app.post("/crawl/stream")
|
||||
|
||||
@@ -71,7 +71,7 @@ def decode_redis_hash(hash_data: Dict[bytes, bytes]) -> Dict[str, str]:
|
||||
|
||||
|
||||
|
||||
def get_llm_api_key(config: Dict, provider: Optional[str] = None) -> str:
|
||||
def get_llm_api_key(config: Dict, provider: Optional[str] = None) -> Optional[str]:
|
||||
"""Get the appropriate API key based on the LLM provider.
|
||||
|
||||
Args:
|
||||
@@ -79,19 +79,14 @@ def get_llm_api_key(config: Dict, provider: Optional[str] = None) -> str:
|
||||
provider: Optional provider override (e.g., "openai/gpt-4")
|
||||
|
||||
Returns:
|
||||
The API key for the provider, or empty string if not found
|
||||
The API key if directly configured, otherwise None to let litellm handle it
|
||||
"""
|
||||
|
||||
# Use provided provider or fall back to config
|
||||
if not provider:
|
||||
provider = config["llm"]["provider"]
|
||||
|
||||
# Check if direct API key is configured
|
||||
# Check if direct API key is configured (for backward compatibility)
|
||||
if "api_key" in config["llm"]:
|
||||
return config["llm"]["api_key"]
|
||||
|
||||
# Fall back to the configured api_key_env if no match
|
||||
return os.environ.get(config["llm"].get("api_key_env", ""), "")
|
||||
# Return None - litellm will automatically find the right environment variable
|
||||
return None
|
||||
|
||||
|
||||
def validate_llm_provider(config: Dict, provider: Optional[str] = None) -> tuple[bool, str]:
|
||||
@@ -104,19 +99,78 @@ def validate_llm_provider(config: Dict, provider: Optional[str] = None) -> tuple
|
||||
Returns:
|
||||
Tuple of (is_valid, error_message)
|
||||
"""
|
||||
# Use provided provider or fall back to config
|
||||
if not provider:
|
||||
provider = config["llm"]["provider"]
|
||||
|
||||
# Get the API key for this provider
|
||||
api_key = get_llm_api_key(config, provider)
|
||||
|
||||
if not api_key:
|
||||
return False, f"No API key found for provider '{provider}'. Please set the appropriate environment variable."
|
||||
# If a direct API key is configured, validation passes
|
||||
if "api_key" in config["llm"]:
|
||||
return True, ""
|
||||
|
||||
# Otherwise, trust that litellm will find the appropriate environment variable
|
||||
# We can't easily validate this without reimplementing litellm's logic
|
||||
return True, ""
|
||||
|
||||
|
||||
def get_llm_temperature(config: Dict, provider: Optional[str] = None) -> Optional[float]:
|
||||
"""Get temperature setting based on the LLM provider.
|
||||
|
||||
Priority order:
|
||||
1. Provider-specific environment variable (e.g., OPENAI_TEMPERATURE)
|
||||
2. Global LLM_TEMPERATURE environment variable
|
||||
3. None (to use litellm/provider defaults)
|
||||
|
||||
Args:
|
||||
config: The application configuration dictionary
|
||||
provider: Optional provider override (e.g., "openai/gpt-4")
|
||||
|
||||
Returns:
|
||||
The temperature setting if configured, otherwise None
|
||||
"""
|
||||
# Check provider-specific temperature first
|
||||
if provider:
|
||||
provider_name = provider.split('/')[0].upper()
|
||||
provider_temp = os.environ.get(f"{provider_name}_TEMPERATURE")
|
||||
if provider_temp:
|
||||
try:
|
||||
return float(provider_temp)
|
||||
except ValueError:
|
||||
logging.warning(f"Invalid temperature value for {provider_name}: {provider_temp}")
|
||||
|
||||
# Check global LLM_TEMPERATURE
|
||||
global_temp = os.environ.get("LLM_TEMPERATURE")
|
||||
if global_temp:
|
||||
try:
|
||||
return float(global_temp)
|
||||
except ValueError:
|
||||
logging.warning(f"Invalid global temperature value: {global_temp}")
|
||||
|
||||
# Return None to use litellm/provider defaults
|
||||
return None
|
||||
|
||||
|
||||
def get_llm_base_url(config: Dict, provider: Optional[str] = None) -> Optional[str]:
|
||||
"""Get base URL setting based on the LLM provider.
|
||||
|
||||
Priority order:
|
||||
1. Provider-specific environment variable (e.g., OPENAI_BASE_URL)
|
||||
2. Global LLM_BASE_URL environment variable
|
||||
3. None (to use default endpoints)
|
||||
|
||||
Args:
|
||||
config: The application configuration dictionary
|
||||
provider: Optional provider override (e.g., "openai/gpt-4")
|
||||
|
||||
Returns:
|
||||
The base URL if configured, otherwise None
|
||||
"""
|
||||
# Check provider-specific base URL first
|
||||
if provider:
|
||||
provider_name = provider.split('/')[0].upper()
|
||||
provider_url = os.environ.get(f"{provider_name}_BASE_URL")
|
||||
if provider_url:
|
||||
return provider_url
|
||||
|
||||
# Check global LLM_BASE_URL
|
||||
return os.environ.get("LLM_BASE_URL")
|
||||
|
||||
|
||||
def verify_email_domain(email: str) -> bool:
|
||||
try:
|
||||
domain = email.split('@')[1]
|
||||
|
||||
@@ -1,159 +0,0 @@
|
||||
"""
|
||||
Webhook delivery service for Crawl4AI.
|
||||
|
||||
This module provides webhook notification functionality with exponential backoff retry logic.
|
||||
"""
|
||||
import asyncio
|
||||
import httpx
|
||||
import logging
|
||||
from typing import Dict, Optional
|
||||
from datetime import datetime, timezone
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class WebhookDeliveryService:
|
||||
"""Handles webhook delivery with exponential backoff retry logic."""
|
||||
|
||||
def __init__(self, config: Dict):
|
||||
"""
|
||||
Initialize the webhook delivery service.
|
||||
|
||||
Args:
|
||||
config: Application configuration dictionary containing webhook settings
|
||||
"""
|
||||
self.config = config.get("webhooks", {})
|
||||
self.max_attempts = self.config.get("retry", {}).get("max_attempts", 5)
|
||||
self.initial_delay = self.config.get("retry", {}).get("initial_delay_ms", 1000) / 1000
|
||||
self.max_delay = self.config.get("retry", {}).get("max_delay_ms", 32000) / 1000
|
||||
self.timeout = self.config.get("retry", {}).get("timeout_ms", 30000) / 1000
|
||||
|
||||
async def send_webhook(
|
||||
self,
|
||||
webhook_url: str,
|
||||
payload: Dict,
|
||||
headers: Optional[Dict[str, str]] = None
|
||||
) -> bool:
|
||||
"""
|
||||
Send webhook with exponential backoff retry logic.
|
||||
|
||||
Args:
|
||||
webhook_url: The URL to send the webhook to
|
||||
payload: The JSON payload to send
|
||||
headers: Optional custom headers
|
||||
|
||||
Returns:
|
||||
bool: True if delivered successfully, False otherwise
|
||||
"""
|
||||
default_headers = self.config.get("headers", {})
|
||||
merged_headers = {**default_headers, **(headers or {})}
|
||||
merged_headers["Content-Type"] = "application/json"
|
||||
|
||||
async with httpx.AsyncClient(timeout=self.timeout) as client:
|
||||
for attempt in range(self.max_attempts):
|
||||
try:
|
||||
logger.info(
|
||||
f"Sending webhook (attempt {attempt + 1}/{self.max_attempts}) to {webhook_url}"
|
||||
)
|
||||
|
||||
response = await client.post(
|
||||
webhook_url,
|
||||
json=payload,
|
||||
headers=merged_headers
|
||||
)
|
||||
|
||||
# Success or client error (don't retry client errors)
|
||||
if response.status_code < 500:
|
||||
if 200 <= response.status_code < 300:
|
||||
logger.info(f"Webhook delivered successfully to {webhook_url}")
|
||||
return True
|
||||
else:
|
||||
logger.warning(
|
||||
f"Webhook rejected with status {response.status_code}: {response.text[:200]}"
|
||||
)
|
||||
return False # Client error - don't retry
|
||||
|
||||
# Server error - retry with backoff
|
||||
logger.warning(
|
||||
f"Webhook failed with status {response.status_code}, will retry"
|
||||
)
|
||||
|
||||
except httpx.TimeoutException as exc:
|
||||
logger.error(f"Webhook timeout (attempt {attempt + 1}): {exc}")
|
||||
except httpx.RequestError as exc:
|
||||
logger.error(f"Webhook request error (attempt {attempt + 1}): {exc}")
|
||||
except Exception as exc:
|
||||
logger.error(f"Webhook delivery error (attempt {attempt + 1}): {exc}")
|
||||
|
||||
# Calculate exponential backoff delay
|
||||
if attempt < self.max_attempts - 1:
|
||||
delay = min(self.initial_delay * (2 ** attempt), self.max_delay)
|
||||
logger.info(f"Retrying in {delay}s...")
|
||||
await asyncio.sleep(delay)
|
||||
|
||||
logger.error(
|
||||
f"Webhook delivery failed after {self.max_attempts} attempts to {webhook_url}"
|
||||
)
|
||||
return False
|
||||
|
||||
async def notify_job_completion(
|
||||
self,
|
||||
task_id: str,
|
||||
task_type: str,
|
||||
status: str,
|
||||
urls: list,
|
||||
webhook_config: Optional[Dict],
|
||||
result: Optional[Dict] = None,
|
||||
error: Optional[str] = None
|
||||
):
|
||||
"""
|
||||
Notify webhook of job completion.
|
||||
|
||||
Args:
|
||||
task_id: The task identifier
|
||||
task_type: Type of task (e.g., "crawl", "llm_extraction")
|
||||
status: Task status ("completed" or "failed")
|
||||
urls: List of URLs that were crawled
|
||||
webhook_config: Webhook configuration from the job request
|
||||
result: Optional crawl result data
|
||||
error: Optional error message if failed
|
||||
"""
|
||||
# Determine webhook URL
|
||||
webhook_url = None
|
||||
data_in_payload = self.config.get("data_in_payload", False)
|
||||
custom_headers = None
|
||||
|
||||
if webhook_config:
|
||||
webhook_url = webhook_config.get("webhook_url")
|
||||
data_in_payload = webhook_config.get("webhook_data_in_payload", data_in_payload)
|
||||
custom_headers = webhook_config.get("webhook_headers")
|
||||
|
||||
if not webhook_url:
|
||||
webhook_url = self.config.get("default_url")
|
||||
|
||||
if not webhook_url:
|
||||
logger.debug("No webhook URL configured, skipping notification")
|
||||
return
|
||||
|
||||
# Check if webhooks are enabled
|
||||
if not self.config.get("enabled", True):
|
||||
logger.debug("Webhooks are disabled, skipping notification")
|
||||
return
|
||||
|
||||
# Build payload
|
||||
payload = {
|
||||
"task_id": task_id,
|
||||
"task_type": task_type,
|
||||
"status": status,
|
||||
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||
"urls": urls
|
||||
}
|
||||
|
||||
if error:
|
||||
payload["error"] = error
|
||||
|
||||
if data_in_payload and result:
|
||||
payload["data"] = result
|
||||
|
||||
# Send webhook (fire and forget - don't block on completion)
|
||||
await self.send_webhook(webhook_url, payload, custom_headers)
|
||||
@@ -1,461 +0,0 @@
|
||||
"""
|
||||
Docker Webhook Example for Crawl4AI
|
||||
|
||||
This example demonstrates how to use webhooks with the Crawl4AI job queue API.
|
||||
Instead of polling for results, webhooks notify your application when jobs complete.
|
||||
|
||||
Supports both:
|
||||
- /crawl/job - Raw crawling with markdown extraction
|
||||
- /llm/job - LLM-powered content extraction
|
||||
|
||||
Prerequisites:
|
||||
1. Crawl4AI Docker container running on localhost:11234
|
||||
2. Flask installed: pip install flask requests
|
||||
3. LLM API key configured in .llm.env (for LLM extraction examples)
|
||||
|
||||
Usage:
|
||||
1. Run this script: python docker_webhook_example.py
|
||||
2. The webhook server will start on http://localhost:8080
|
||||
3. Jobs will be submitted and webhooks will be received automatically
|
||||
"""
|
||||
|
||||
import requests
|
||||
import json
|
||||
import time
|
||||
from flask import Flask, request, jsonify
|
||||
from threading import Thread
|
||||
|
||||
# Configuration
|
||||
CRAWL4AI_BASE_URL = "http://localhost:11234"
|
||||
WEBHOOK_BASE_URL = "http://localhost:8080" # Your webhook receiver URL
|
||||
|
||||
# Initialize Flask app for webhook receiver
|
||||
app = Flask(__name__)
|
||||
|
||||
# Store received webhook data for demonstration
|
||||
received_webhooks = []
|
||||
|
||||
|
||||
@app.route('/webhooks/crawl-complete', methods=['POST'])
|
||||
def handle_crawl_webhook():
|
||||
"""
|
||||
Webhook handler that receives notifications when crawl jobs complete.
|
||||
|
||||
Payload structure:
|
||||
{
|
||||
"task_id": "crawl_abc123",
|
||||
"task_type": "crawl",
|
||||
"status": "completed" or "failed",
|
||||
"timestamp": "2025-10-21T10:30:00.000000+00:00",
|
||||
"urls": ["https://example.com"],
|
||||
"error": "error message" (only if failed),
|
||||
"data": {...} (only if webhook_data_in_payload=True)
|
||||
}
|
||||
"""
|
||||
payload = request.json
|
||||
print(f"\n{'='*60}")
|
||||
print(f"📬 Webhook received for task: {payload['task_id']}")
|
||||
print(f" Status: {payload['status']}")
|
||||
print(f" Timestamp: {payload['timestamp']}")
|
||||
print(f" URLs: {payload['urls']}")
|
||||
|
||||
if payload['status'] == 'completed':
|
||||
# If data is in payload, process it directly
|
||||
if 'data' in payload:
|
||||
print(f" ✅ Data included in webhook")
|
||||
data = payload['data']
|
||||
# Process the crawl results here
|
||||
for result in data.get('results', []):
|
||||
print(f" - Crawled: {result.get('url')}")
|
||||
print(f" - Markdown length: {len(result.get('markdown', ''))}")
|
||||
else:
|
||||
# Fetch results from API if not included
|
||||
print(f" 📥 Fetching results from API...")
|
||||
task_id = payload['task_id']
|
||||
result_response = requests.get(f"{CRAWL4AI_BASE_URL}/crawl/job/{task_id}")
|
||||
if result_response.ok:
|
||||
data = result_response.json()
|
||||
print(f" ✅ Results fetched successfully")
|
||||
# Process the crawl results here
|
||||
for result in data['result'].get('results', []):
|
||||
print(f" - Crawled: {result.get('url')}")
|
||||
print(f" - Markdown length: {len(result.get('markdown', ''))}")
|
||||
|
||||
elif payload['status'] == 'failed':
|
||||
print(f" ❌ Job failed: {payload.get('error', 'Unknown error')}")
|
||||
|
||||
print(f"{'='*60}\n")
|
||||
|
||||
# Store webhook for demonstration
|
||||
received_webhooks.append(payload)
|
||||
|
||||
# Return 200 OK to acknowledge receipt
|
||||
return jsonify({"status": "received"}), 200
|
||||
|
||||
|
||||
@app.route('/webhooks/llm-complete', methods=['POST'])
|
||||
def handle_llm_webhook():
|
||||
"""
|
||||
Webhook handler that receives notifications when LLM extraction jobs complete.
|
||||
|
||||
Payload structure:
|
||||
{
|
||||
"task_id": "llm_1698765432_12345",
|
||||
"task_type": "llm_extraction",
|
||||
"status": "completed" or "failed",
|
||||
"timestamp": "2025-10-21T10:30:00.000000+00:00",
|
||||
"urls": ["https://example.com/article"],
|
||||
"error": "error message" (only if failed),
|
||||
"data": {"extracted_content": {...}} (only if webhook_data_in_payload=True)
|
||||
}
|
||||
"""
|
||||
payload = request.json
|
||||
print(f"\n{'='*60}")
|
||||
print(f"🤖 LLM Webhook received for task: {payload['task_id']}")
|
||||
print(f" Task Type: {payload['task_type']}")
|
||||
print(f" Status: {payload['status']}")
|
||||
print(f" Timestamp: {payload['timestamp']}")
|
||||
print(f" URL: {payload['urls'][0]}")
|
||||
|
||||
if payload['status'] == 'completed':
|
||||
# If data is in payload, process it directly
|
||||
if 'data' in payload:
|
||||
print(f" ✅ Data included in webhook")
|
||||
data = payload['data']
|
||||
# Webhook wraps extracted content in 'extracted_content' field
|
||||
extracted = data.get('extracted_content', {})
|
||||
print(f" - Extracted content:")
|
||||
print(f" {json.dumps(extracted, indent=8)}")
|
||||
else:
|
||||
# Fetch results from API if not included
|
||||
print(f" 📥 Fetching results from API...")
|
||||
task_id = payload['task_id']
|
||||
result_response = requests.get(f"{CRAWL4AI_BASE_URL}/llm/job/{task_id}")
|
||||
if result_response.ok:
|
||||
data = result_response.json()
|
||||
print(f" ✅ Results fetched successfully")
|
||||
# API returns unwrapped content in 'result' field
|
||||
extracted = data['result']
|
||||
print(f" - Extracted content:")
|
||||
print(f" {json.dumps(extracted, indent=8)}")
|
||||
|
||||
elif payload['status'] == 'failed':
|
||||
print(f" ❌ Job failed: {payload.get('error', 'Unknown error')}")
|
||||
|
||||
print(f"{'='*60}\n")
|
||||
|
||||
# Store webhook for demonstration
|
||||
received_webhooks.append(payload)
|
||||
|
||||
# Return 200 OK to acknowledge receipt
|
||||
return jsonify({"status": "received"}), 200
|
||||
|
||||
|
||||
def start_webhook_server():
|
||||
"""Start the Flask webhook server in a separate thread"""
|
||||
app.run(host='0.0.0.0', port=8080, debug=False, use_reloader=False)
|
||||
|
||||
|
||||
def submit_crawl_job_with_webhook(urls, webhook_url, include_data=False):
|
||||
"""
|
||||
Submit a crawl job with webhook notification.
|
||||
|
||||
Args:
|
||||
urls: List of URLs to crawl
|
||||
webhook_url: URL to receive webhook notifications
|
||||
include_data: Whether to include full results in webhook payload
|
||||
|
||||
Returns:
|
||||
task_id: The job's task identifier
|
||||
"""
|
||||
payload = {
|
||||
"urls": urls,
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {"cache_mode": "bypass"},
|
||||
"webhook_config": {
|
||||
"webhook_url": webhook_url,
|
||||
"webhook_data_in_payload": include_data,
|
||||
# Optional: Add custom headers for authentication
|
||||
# "webhook_headers": {
|
||||
# "X-Webhook-Secret": "your-secret-token"
|
||||
# }
|
||||
}
|
||||
}
|
||||
|
||||
print(f"\n🚀 Submitting crawl job...")
|
||||
print(f" URLs: {urls}")
|
||||
print(f" Webhook: {webhook_url}")
|
||||
print(f" Include data: {include_data}")
|
||||
|
||||
response = requests.post(
|
||||
f"{CRAWL4AI_BASE_URL}/crawl/job",
|
||||
json=payload,
|
||||
headers={"Content-Type": "application/json"}
|
||||
)
|
||||
|
||||
if response.ok:
|
||||
data = response.json()
|
||||
task_id = data['task_id']
|
||||
print(f" ✅ Job submitted successfully")
|
||||
print(f" Task ID: {task_id}")
|
||||
return task_id
|
||||
else:
|
||||
print(f" ❌ Failed to submit job: {response.text}")
|
||||
return None
|
||||
|
||||
|
||||
def submit_llm_job_with_webhook(url, query, webhook_url, include_data=False, schema=None, provider=None):
|
||||
"""
|
||||
Submit an LLM extraction job with webhook notification.
|
||||
|
||||
Args:
|
||||
url: URL to extract content from
|
||||
query: Instruction for the LLM (e.g., "Extract article title and author")
|
||||
webhook_url: URL to receive webhook notifications
|
||||
include_data: Whether to include full results in webhook payload
|
||||
schema: Optional JSON schema for structured extraction
|
||||
provider: Optional LLM provider (e.g., "openai/gpt-4o-mini")
|
||||
|
||||
Returns:
|
||||
task_id: The job's task identifier
|
||||
"""
|
||||
payload = {
|
||||
"url": url,
|
||||
"q": query,
|
||||
"cache": False,
|
||||
"webhook_config": {
|
||||
"webhook_url": webhook_url,
|
||||
"webhook_data_in_payload": include_data,
|
||||
# Optional: Add custom headers for authentication
|
||||
# "webhook_headers": {
|
||||
# "X-Webhook-Secret": "your-secret-token"
|
||||
# }
|
||||
}
|
||||
}
|
||||
|
||||
if schema:
|
||||
payload["schema"] = schema
|
||||
|
||||
if provider:
|
||||
payload["provider"] = provider
|
||||
|
||||
print(f"\n🤖 Submitting LLM extraction job...")
|
||||
print(f" URL: {url}")
|
||||
print(f" Query: {query}")
|
||||
print(f" Webhook: {webhook_url}")
|
||||
print(f" Include data: {include_data}")
|
||||
if provider:
|
||||
print(f" Provider: {provider}")
|
||||
|
||||
response = requests.post(
|
||||
f"{CRAWL4AI_BASE_URL}/llm/job",
|
||||
json=payload,
|
||||
headers={"Content-Type": "application/json"}
|
||||
)
|
||||
|
||||
if response.ok:
|
||||
data = response.json()
|
||||
task_id = data['task_id']
|
||||
print(f" ✅ Job submitted successfully")
|
||||
print(f" Task ID: {task_id}")
|
||||
return task_id
|
||||
else:
|
||||
print(f" ❌ Failed to submit job: {response.text}")
|
||||
return None
|
||||
|
||||
|
||||
def submit_job_without_webhook(urls):
|
||||
"""
|
||||
Submit a job without webhook (traditional polling approach).
|
||||
|
||||
Args:
|
||||
urls: List of URLs to crawl
|
||||
|
||||
Returns:
|
||||
task_id: The job's task identifier
|
||||
"""
|
||||
payload = {
|
||||
"urls": urls,
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {"cache_mode": "bypass"}
|
||||
}
|
||||
|
||||
print(f"\n🚀 Submitting crawl job (without webhook)...")
|
||||
print(f" URLs: {urls}")
|
||||
|
||||
response = requests.post(
|
||||
f"{CRAWL4AI_BASE_URL}/crawl/job",
|
||||
json=payload
|
||||
)
|
||||
|
||||
if response.ok:
|
||||
data = response.json()
|
||||
task_id = data['task_id']
|
||||
print(f" ✅ Job submitted successfully")
|
||||
print(f" Task ID: {task_id}")
|
||||
return task_id
|
||||
else:
|
||||
print(f" ❌ Failed to submit job: {response.text}")
|
||||
return None
|
||||
|
||||
|
||||
def poll_job_status(task_id, timeout=60):
|
||||
"""
|
||||
Poll for job status (used when webhook is not configured).
|
||||
|
||||
Args:
|
||||
task_id: The job's task identifier
|
||||
timeout: Maximum time to wait in seconds
|
||||
"""
|
||||
print(f"\n⏳ Polling for job status...")
|
||||
start_time = time.time()
|
||||
|
||||
while time.time() - start_time < timeout:
|
||||
response = requests.get(f"{CRAWL4AI_BASE_URL}/crawl/job/{task_id}")
|
||||
|
||||
if response.ok:
|
||||
data = response.json()
|
||||
status = data.get('status', 'unknown')
|
||||
|
||||
if status == 'completed':
|
||||
print(f" ✅ Job completed!")
|
||||
return data
|
||||
elif status == 'failed':
|
||||
print(f" ❌ Job failed: {data.get('error', 'Unknown error')}")
|
||||
return data
|
||||
else:
|
||||
print(f" ⏳ Status: {status}, waiting...")
|
||||
time.sleep(2)
|
||||
else:
|
||||
print(f" ❌ Failed to get status: {response.text}")
|
||||
return None
|
||||
|
||||
print(f" ⏰ Timeout reached")
|
||||
return None
|
||||
|
||||
|
||||
def main():
|
||||
"""Run the webhook demonstration"""
|
||||
|
||||
# Check if Crawl4AI is running
|
||||
try:
|
||||
health = requests.get(f"{CRAWL4AI_BASE_URL}/health", timeout=5)
|
||||
print(f"✅ Crawl4AI is running: {health.json()}")
|
||||
except:
|
||||
print(f"❌ Cannot connect to Crawl4AI at {CRAWL4AI_BASE_URL}")
|
||||
print(" Please make sure Docker container is running:")
|
||||
print(" docker run -d -p 11234:11234 --name crawl4ai unclecode/crawl4ai:latest")
|
||||
return
|
||||
|
||||
# Start webhook server in background thread
|
||||
print(f"\n🌐 Starting webhook server at {WEBHOOK_BASE_URL}...")
|
||||
webhook_thread = Thread(target=start_webhook_server, daemon=True)
|
||||
webhook_thread.start()
|
||||
time.sleep(2) # Give server time to start
|
||||
|
||||
# Example 1: Job with webhook (notification only, fetch data separately)
|
||||
print(f"\n{'='*60}")
|
||||
print("Example 1: Webhook Notification Only")
|
||||
print(f"{'='*60}")
|
||||
task_id_1 = submit_crawl_job_with_webhook(
|
||||
urls=["https://example.com"],
|
||||
webhook_url=f"{WEBHOOK_BASE_URL}/webhooks/crawl-complete",
|
||||
include_data=False
|
||||
)
|
||||
|
||||
# Example 2: Job with webhook (data included in payload)
|
||||
time.sleep(5) # Wait a bit between requests
|
||||
print(f"\n{'='*60}")
|
||||
print("Example 2: Webhook with Full Data")
|
||||
print(f"{'='*60}")
|
||||
task_id_2 = submit_crawl_job_with_webhook(
|
||||
urls=["https://www.python.org"],
|
||||
webhook_url=f"{WEBHOOK_BASE_URL}/webhooks/crawl-complete",
|
||||
include_data=True
|
||||
)
|
||||
|
||||
# Example 3: LLM extraction with webhook (notification only)
|
||||
time.sleep(5) # Wait a bit between requests
|
||||
print(f"\n{'='*60}")
|
||||
print("Example 3: LLM Extraction with Webhook (Notification Only)")
|
||||
print(f"{'='*60}")
|
||||
task_id_3 = submit_llm_job_with_webhook(
|
||||
url="https://www.example.com",
|
||||
query="Extract the main heading and description from this page.",
|
||||
webhook_url=f"{WEBHOOK_BASE_URL}/webhooks/llm-complete",
|
||||
include_data=False,
|
||||
provider="openai/gpt-4o-mini"
|
||||
)
|
||||
|
||||
# Example 4: LLM extraction with webhook (data included + schema)
|
||||
time.sleep(5) # Wait a bit between requests
|
||||
print(f"\n{'='*60}")
|
||||
print("Example 4: LLM Extraction with Schema and Full Data")
|
||||
print(f"{'='*60}")
|
||||
|
||||
# Define a schema for structured extraction
|
||||
schema = json.dumps({
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"title": {"type": "string", "description": "Page title"},
|
||||
"description": {"type": "string", "description": "Page description"}
|
||||
},
|
||||
"required": ["title"]
|
||||
})
|
||||
|
||||
task_id_4 = submit_llm_job_with_webhook(
|
||||
url="https://www.python.org",
|
||||
query="Extract the title and description of this website",
|
||||
webhook_url=f"{WEBHOOK_BASE_URL}/webhooks/llm-complete",
|
||||
include_data=True,
|
||||
schema=schema,
|
||||
provider="openai/gpt-4o-mini"
|
||||
)
|
||||
|
||||
# Example 5: Traditional polling (no webhook)
|
||||
time.sleep(5) # Wait a bit between requests
|
||||
print(f"\n{'='*60}")
|
||||
print("Example 5: Traditional Polling (No Webhook)")
|
||||
print(f"{'='*60}")
|
||||
task_id_5 = submit_job_without_webhook(
|
||||
urls=["https://github.com"]
|
||||
)
|
||||
if task_id_5:
|
||||
result = poll_job_status(task_id_5)
|
||||
if result and result.get('status') == 'completed':
|
||||
print(f" ✅ Results retrieved via polling")
|
||||
|
||||
# Wait for webhooks to arrive
|
||||
print(f"\n⏳ Waiting for webhooks to be received...")
|
||||
time.sleep(30) # Give jobs time to complete and webhooks to arrive (longer for LLM)
|
||||
|
||||
# Summary
|
||||
print(f"\n{'='*60}")
|
||||
print("Summary")
|
||||
print(f"{'='*60}")
|
||||
print(f"Total webhooks received: {len(received_webhooks)}")
|
||||
|
||||
crawl_webhooks = [w for w in received_webhooks if w['task_type'] == 'crawl']
|
||||
llm_webhooks = [w for w in received_webhooks if w['task_type'] == 'llm_extraction']
|
||||
|
||||
print(f"\n📊 Breakdown:")
|
||||
print(f" - Crawl webhooks: {len(crawl_webhooks)}")
|
||||
print(f" - LLM extraction webhooks: {len(llm_webhooks)}")
|
||||
|
||||
print(f"\n📋 Details:")
|
||||
for i, webhook in enumerate(received_webhooks, 1):
|
||||
task_type = webhook['task_type']
|
||||
icon = "🕷️" if task_type == "crawl" else "🤖"
|
||||
print(f"{i}. {icon} Task {webhook['task_id']}: {webhook['status']} ({task_type})")
|
||||
|
||||
print(f"\n✅ Demo completed!")
|
||||
print(f"\n💡 Pro tips:")
|
||||
print(f" - In production, your webhook URL should be publicly accessible")
|
||||
print(f" (e.g., https://myapp.com/webhooks) or use ngrok for testing")
|
||||
print(f" - Both /crawl/job and /llm/job support the same webhook configuration")
|
||||
print(f" - Use webhook_data_in_payload=true to get results directly in the webhook")
|
||||
print(f" - LLM jobs may take longer, adjust timeouts accordingly")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
221
docs/examples/website-to-api/.gitignore
vendored
Normal file
221
docs/examples/website-to-api/.gitignore
vendored
Normal file
@@ -0,0 +1,221 @@
|
||||
# Byte-compiled / optimized / DLL files
|
||||
__pycache__/
|
||||
*.py[codz]
|
||||
*$py.class
|
||||
|
||||
# C extensions
|
||||
*.so
|
||||
|
||||
# Distribution / packaging
|
||||
.Python
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheels/
|
||||
share/python-wheels/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
MANIFEST
|
||||
|
||||
# PyInstaller
|
||||
# Usually these files are written by a python script from a template
|
||||
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
||||
*.manifest
|
||||
*.spec
|
||||
|
||||
# Installer logs
|
||||
pip-log.txt
|
||||
pip-delete-this-directory.txt
|
||||
|
||||
# Unit test / coverage reports
|
||||
htmlcov/
|
||||
.tox/
|
||||
.nox/
|
||||
.coverage
|
||||
.coverage.*
|
||||
.cache
|
||||
nosetests.xml
|
||||
coverage.xml
|
||||
*.cover
|
||||
*.py.cover
|
||||
.hypothesis/
|
||||
.pytest_cache/
|
||||
cover/
|
||||
|
||||
# Translations
|
||||
*.mo
|
||||
*.pot
|
||||
|
||||
# Django stuff:
|
||||
*.log
|
||||
local_settings.py
|
||||
db.sqlite3
|
||||
db.sqlite3-journal
|
||||
|
||||
# Flask stuff:
|
||||
instance/
|
||||
.webassets-cache
|
||||
|
||||
# Scrapy stuff:
|
||||
.scrapy
|
||||
|
||||
# Sphinx documentation
|
||||
docs/_build/
|
||||
|
||||
# PyBuilder
|
||||
.pybuilder/
|
||||
target/
|
||||
|
||||
# Jupyter Notebook
|
||||
.ipynb_checkpoints
|
||||
|
||||
# IPython
|
||||
profile_default/
|
||||
ipython_config.py
|
||||
|
||||
# pyenv
|
||||
# For a library or package, you might want to ignore these files since the code is
|
||||
# intended to run in multiple environments; otherwise, check them in:
|
||||
# .python-version
|
||||
|
||||
# pipenv
|
||||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
||||
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
||||
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
||||
# install all needed dependencies.
|
||||
#Pipfile.lock
|
||||
|
||||
# UV
|
||||
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
|
||||
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
||||
# commonly ignored for libraries.
|
||||
#uv.lock
|
||||
|
||||
# poetry
|
||||
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
|
||||
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
||||
# commonly ignored for libraries.
|
||||
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
|
||||
#poetry.lock
|
||||
#poetry.toml
|
||||
|
||||
# pdm
|
||||
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
|
||||
# pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
|
||||
# https://pdm-project.org/en/latest/usage/project/#working-with-version-control
|
||||
#pdm.lock
|
||||
#pdm.toml
|
||||
.pdm-python
|
||||
.pdm-build/
|
||||
|
||||
# pixi
|
||||
# Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
|
||||
#pixi.lock
|
||||
# Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
|
||||
# in the .venv directory. It is recommended not to include this directory in version control.
|
||||
.pixi
|
||||
|
||||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
|
||||
__pypackages__/
|
||||
|
||||
# Celery stuff
|
||||
celerybeat-schedule
|
||||
celerybeat.pid
|
||||
|
||||
# Redis
|
||||
*.rdb
|
||||
*.aof
|
||||
*.pid
|
||||
|
||||
# RabbitMQ
|
||||
mnesia/
|
||||
rabbitmq/
|
||||
rabbitmq-data/
|
||||
|
||||
# ActiveMQ
|
||||
activemq-data/
|
||||
|
||||
# SageMath parsed files
|
||||
*.sage.py
|
||||
|
||||
# Environments
|
||||
.env
|
||||
.envrc
|
||||
.venv
|
||||
env/
|
||||
venv/
|
||||
ENV/
|
||||
env.bak/
|
||||
venv.bak/
|
||||
|
||||
# Spyder project settings
|
||||
.spyderproject
|
||||
.spyproject
|
||||
|
||||
# Rope project settings
|
||||
.ropeproject
|
||||
|
||||
# mkdocs documentation
|
||||
/site
|
||||
|
||||
# mypy
|
||||
.mypy_cache/
|
||||
.dmypy.json
|
||||
dmypy.json
|
||||
|
||||
# Pyre type checker
|
||||
.pyre/
|
||||
|
||||
# pytype static type analyzer
|
||||
.pytype/
|
||||
|
||||
# Cython debug symbols
|
||||
cython_debug/
|
||||
|
||||
# PyCharm
|
||||
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
|
||||
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
|
||||
# and can be added to the global gitignore or merged into this file. For a more nuclear
|
||||
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
|
||||
#.idea/
|
||||
|
||||
# Abstra
|
||||
# Abstra is an AI-powered process automation framework.
|
||||
# Ignore directories containing user credentials, local state, and settings.
|
||||
# Learn more at https://abstra.io/docs
|
||||
.abstra/
|
||||
|
||||
# Visual Studio Code
|
||||
# Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
|
||||
# that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
|
||||
# and can be added to the global gitignore or merged into this file. However, if you prefer,
|
||||
# you could uncomment the following to ignore the entire vscode folder
|
||||
# .vscode/
|
||||
|
||||
# Ruff stuff:
|
||||
.ruff_cache/
|
||||
|
||||
# PyPI configuration file
|
||||
.pypirc
|
||||
|
||||
# Marimo
|
||||
marimo/_static/
|
||||
marimo/_lsp/
|
||||
__marimo__/
|
||||
|
||||
# Streamlit
|
||||
.streamlit/secrets.toml
|
||||
|
||||
#directories
|
||||
models
|
||||
schemas
|
||||
saved_requests
|
||||
252
docs/examples/website-to-api/README.md
Normal file
252
docs/examples/website-to-api/README.md
Normal file
@@ -0,0 +1,252 @@
|
||||
# Web Scraper API with Custom Model Support
|
||||
|
||||
A powerful web scraping API that converts any website into structured data using AI. Features a beautiful minimalist frontend interface and support for custom LLM models!
|
||||
|
||||
## Features
|
||||
|
||||
- **AI-Powered Scraping**: Provide a URL and plain English query to extract structured data
|
||||
- **Beautiful Frontend**: Modern minimalist black-and-white interface with smooth UX
|
||||
- **Custom Model Support**: Use any LLM provider (OpenAI, Gemini, Anthropic, etc.) with your own API keys
|
||||
- **Model Management**: Save, list, and manage multiple model configurations via web interface
|
||||
- **Dual Scraping Approaches**: Choose between Schema-based (faster) or LLM-based (more flexible) extraction
|
||||
- **API Request History**: Automatic saving and display of all API requests with cURL commands
|
||||
- **Schema Caching**: Intelligent caching of generated schemas for faster subsequent requests
|
||||
- **Duplicate Prevention**: Avoids saving duplicate requests (same URL + query)
|
||||
- **RESTful API**: Easy-to-use HTTP endpoints for all operations
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Install Dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 2. Start the API Server
|
||||
|
||||
```bash
|
||||
python app.py
|
||||
```
|
||||
|
||||
The server will start on `http://localhost:8000` with a beautiful web interface!
|
||||
|
||||
### 3. Using the Web Interface
|
||||
|
||||
Once the server is running, open your browser and go to `http://localhost:8000` to access the modern web interface!
|
||||
|
||||
#### Pages:
|
||||
- **Scrape Data**: Enter URLs and queries to extract structured data
|
||||
- **Models**: Manage your AI model configurations (add, list, delete)
|
||||
- **API Requests**: View history of all scraping requests with cURL commands
|
||||
|
||||
#### Features:
|
||||
- **Minimalist Design**: Clean black-and-white theme inspired by modern web apps
|
||||
- **Real-time Results**: See extracted data in formatted JSON
|
||||
- **Copy to Clipboard**: Easy copying of results
|
||||
- **Toast Notifications**: User-friendly feedback
|
||||
- **Dual Scraping Modes**: Choose between Schema-based and LLM-based approaches
|
||||
|
||||
## Model Management
|
||||
|
||||
### Adding Models via Web Interface
|
||||
|
||||
1. Go to the **Models** page
|
||||
2. Enter your model details:
|
||||
- **Provider**: LLM provider (e.g., `gemini/gemini-2.5-flash`, `openai/gpt-4o`)
|
||||
- **API Token**: Your API key for the provider
|
||||
3. Click "Add Model"
|
||||
|
||||
### API Usage for Model Management
|
||||
|
||||
#### Save a Model Configuration
|
||||
|
||||
```bash
|
||||
curl -X POST "http://localhost:8000/models" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"provider": "gemini/gemini-2.5-flash",
|
||||
"api_token": "your-api-key-here"
|
||||
}'
|
||||
```
|
||||
|
||||
#### List Saved Models
|
||||
|
||||
```bash
|
||||
curl -X GET "http://localhost:8000/models"
|
||||
```
|
||||
|
||||
#### Delete a Model Configuration
|
||||
|
||||
```bash
|
||||
curl -X DELETE "http://localhost:8000/models/my-gemini"
|
||||
```
|
||||
|
||||
## Scraping Approaches
|
||||
|
||||
### 1. Schema-based Scraping (Faster)
|
||||
- Generates CSS selectors for targeted extraction
|
||||
- Caches schemas for repeated requests
|
||||
- Faster execution for structured websites
|
||||
|
||||
### 2. LLM-based Scraping (More Flexible)
|
||||
- Direct LLM extraction without schema generation
|
||||
- More flexible for complex or dynamic content
|
||||
- Better for unstructured data extraction
|
||||
|
||||
## Supported LLM Providers
|
||||
|
||||
The API supports any LLM provider that crawl4ai supports, including:
|
||||
|
||||
- **Google Gemini**: `gemini/gemini-2.5-flash`, `gemini/gemini-pro`
|
||||
- **OpenAI**: `openai/gpt-4`, `openai/gpt-3.5-turbo`
|
||||
- **Anthropic**: `anthropic/claude-3-opus`, `anthropic/claude-3-sonnet`
|
||||
- **And more...**
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Core Endpoints
|
||||
|
||||
- `POST /scrape` - Schema-based scraping
|
||||
- `POST /scrape-with-llm` - LLM-based scraping
|
||||
- `GET /schemas` - List cached schemas
|
||||
- `POST /clear-cache` - Clear schema cache
|
||||
- `GET /health` - Health check
|
||||
|
||||
### Model Management Endpoints
|
||||
|
||||
- `GET /models` - List saved model configurations
|
||||
- `POST /models` - Save a new model configuration
|
||||
- `DELETE /models/{model_name}` - Delete a model configuration
|
||||
|
||||
### API Request History
|
||||
|
||||
- `GET /saved-requests` - List all saved API requests
|
||||
- `DELETE /saved-requests/{request_id}` - Delete a saved request
|
||||
|
||||
## Request/Response Examples
|
||||
|
||||
### Scrape Request
|
||||
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"query": "Extract the product name, price, and description",
|
||||
"model_name": "my-custom-model"
|
||||
}
|
||||
```
|
||||
|
||||
### Scrape Response
|
||||
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"url": "https://example.com",
|
||||
"query": "Extract the product name, price, and description",
|
||||
"extracted_data": {
|
||||
"product_name": "Example Product",
|
||||
"price": "$99.99",
|
||||
"description": "This is an example product description"
|
||||
},
|
||||
"schema_used": { ... },
|
||||
"timestamp": "2024-01-01T12:00:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### Model Configuration Request
|
||||
|
||||
```json
|
||||
{
|
||||
"provider": "gemini/gemini-2.5-flash",
|
||||
"api_token": "your-api-key-here"
|
||||
}
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
Run the test script to verify the model management functionality:
|
||||
|
||||
```bash
|
||||
python test_models.py
|
||||
```
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
parse_example/
|
||||
├── api_server.py # FastAPI server with all endpoints
|
||||
├── web_scraper_lib.py # Core scraping library
|
||||
├── test_models.py # Test script for model management
|
||||
├── requirements.txt # Dependencies
|
||||
├── static/ # Frontend files
|
||||
│ ├── index.html # Main HTML interface
|
||||
│ ├── styles.css # CSS styles (minimalist theme)
|
||||
│ └── script.js # JavaScript functionality
|
||||
├── schemas/ # Cached schemas
|
||||
├── models/ # Saved model configurations
|
||||
├── saved_requests/ # API request history
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Using the Library Directly
|
||||
|
||||
```python
|
||||
from web_scraper_lib import WebScraperAgent
|
||||
|
||||
# Initialize agent
|
||||
agent = WebScraperAgent()
|
||||
|
||||
# Save a model configuration
|
||||
agent.save_model_config(
|
||||
model_name="my-model",
|
||||
provider="openai/gpt-4",
|
||||
api_token="your-api-key"
|
||||
)
|
||||
|
||||
# Schema-based scraping
|
||||
result = await agent.scrape_data(
|
||||
url="https://example.com",
|
||||
query="Extract product information",
|
||||
model_name="my-model"
|
||||
)
|
||||
|
||||
# LLM-based scraping
|
||||
result = await agent.scrape_data_with_llm(
|
||||
url="https://example.com",
|
||||
query="Extract product information",
|
||||
model_name="my-model"
|
||||
)
|
||||
```
|
||||
|
||||
### Schema Caching
|
||||
|
||||
The system automatically caches generated schemas based on URL and query combinations:
|
||||
|
||||
- **First request**: Generates schema using AI
|
||||
- **Subsequent requests**: Uses cached schema for faster extraction
|
||||
|
||||
### API Request History
|
||||
|
||||
All API requests are automatically saved with:
|
||||
- Request details (URL, query, model used)
|
||||
- Response data
|
||||
- Timestamp
|
||||
- cURL command for re-execution
|
||||
|
||||
### Duplicate Prevention
|
||||
|
||||
The system prevents saving duplicate requests:
|
||||
- Same URL + query combinations are not saved multiple times
|
||||
- Returns existing request ID for duplicates
|
||||
- Keeps the API request history clean
|
||||
|
||||
## Error Handling
|
||||
|
||||
The API provides detailed error messages for common issues:
|
||||
|
||||
- Invalid URLs
|
||||
- Missing model configurations
|
||||
- API key errors
|
||||
- Network timeouts
|
||||
- Parsing errors
|
||||
363
docs/examples/website-to-api/api_server.py
Normal file
363
docs/examples/website-to-api/api_server.py
Normal file
@@ -0,0 +1,363 @@
|
||||
from fastapi import FastAPI, HTTPException
|
||||
from fastapi.staticfiles import StaticFiles
|
||||
from fastapi.responses import FileResponse
|
||||
from pydantic import BaseModel, HttpUrl
|
||||
from typing import Dict, Any, Optional, Union, List
|
||||
import uvicorn
|
||||
import asyncio
|
||||
import os
|
||||
import json
|
||||
from datetime import datetime
|
||||
from web_scraper_lib import WebScraperAgent, scrape_website
|
||||
|
||||
app = FastAPI(
|
||||
title="Web Scraper API",
|
||||
description="Convert any website into a structured data API. Provide a URL and tell AI what data you need in plain English.",
|
||||
version="1.0.0"
|
||||
)
|
||||
|
||||
# Mount static files
|
||||
if os.path.exists("static"):
|
||||
app.mount("/static", StaticFiles(directory="static"), name="static")
|
||||
|
||||
# Mount assets directory
|
||||
if os.path.exists("assets"):
|
||||
app.mount("/assets", StaticFiles(directory="assets"), name="assets")
|
||||
|
||||
# Initialize the scraper agent
|
||||
scraper_agent = WebScraperAgent()
|
||||
|
||||
# Create directory for saved API requests
|
||||
os.makedirs("saved_requests", exist_ok=True)
|
||||
|
||||
class ScrapeRequest(BaseModel):
|
||||
url: HttpUrl
|
||||
query: str
|
||||
model_name: Optional[str] = None
|
||||
|
||||
class ModelConfigRequest(BaseModel):
|
||||
model_name: str
|
||||
provider: str
|
||||
api_token: str
|
||||
|
||||
class ScrapeResponse(BaseModel):
|
||||
success: bool
|
||||
url: str
|
||||
query: str
|
||||
extracted_data: Union[Dict[str, Any], list]
|
||||
schema_used: Optional[Dict[str, Any]] = None
|
||||
timestamp: Optional[str] = None
|
||||
error: Optional[str] = None
|
||||
|
||||
class SavedApiRequest(BaseModel):
|
||||
id: str
|
||||
endpoint: str
|
||||
method: str
|
||||
headers: Dict[str, str]
|
||||
body: Dict[str, Any]
|
||||
timestamp: str
|
||||
response: Optional[Dict[str, Any]] = None
|
||||
|
||||
def save_api_request(endpoint: str, method: str, headers: Dict[str, str], body: Dict[str, Any], response: Optional[Dict[str, Any]] = None) -> str:
|
||||
"""Save an API request to a JSON file."""
|
||||
|
||||
# Check for duplicate requests (same URL and query)
|
||||
if endpoint in ["/scrape", "/scrape-with-llm"] and "url" in body and "query" in body:
|
||||
existing_requests = get_saved_requests()
|
||||
for existing_request in existing_requests:
|
||||
if (existing_request.endpoint == endpoint and
|
||||
existing_request.body.get("url") == body["url"] and
|
||||
existing_request.body.get("query") == body["query"]):
|
||||
print(f"Duplicate request found for URL: {body['url']} and query: {body['query']}")
|
||||
return existing_request.id # Return existing request ID instead of creating new one
|
||||
|
||||
request_id = datetime.now().strftime("%Y%m%d_%H%M%S_%f")[:-3]
|
||||
|
||||
saved_request = SavedApiRequest(
|
||||
id=request_id,
|
||||
endpoint=endpoint,
|
||||
method=method,
|
||||
headers=headers,
|
||||
body=body,
|
||||
timestamp=datetime.now().isoformat(),
|
||||
response=response
|
||||
)
|
||||
|
||||
file_path = os.path.join("saved_requests", f"{request_id}.json")
|
||||
with open(file_path, "w") as f:
|
||||
json.dump(saved_request.dict(), f, indent=2)
|
||||
|
||||
return request_id
|
||||
|
||||
def get_saved_requests() -> List[SavedApiRequest]:
|
||||
"""Get all saved API requests."""
|
||||
requests = []
|
||||
if os.path.exists("saved_requests"):
|
||||
for filename in os.listdir("saved_requests"):
|
||||
if filename.endswith('.json'):
|
||||
file_path = os.path.join("saved_requests", filename)
|
||||
try:
|
||||
with open(file_path, "r") as f:
|
||||
data = json.load(f)
|
||||
requests.append(SavedApiRequest(**data))
|
||||
except Exception as e:
|
||||
print(f"Error loading saved request {filename}: {e}")
|
||||
|
||||
# Sort by timestamp (newest first)
|
||||
requests.sort(key=lambda x: x.timestamp, reverse=True)
|
||||
return requests
|
||||
|
||||
@app.get("/")
|
||||
async def root():
|
||||
"""Serve the frontend interface."""
|
||||
if os.path.exists("static/index.html"):
|
||||
return FileResponse("static/index.html")
|
||||
else:
|
||||
return {
|
||||
"message": "Web Scraper API",
|
||||
"description": "Convert any website into structured data with AI",
|
||||
"endpoints": {
|
||||
"/scrape": "POST - Scrape data from a website",
|
||||
"/schemas": "GET - List cached schemas",
|
||||
"/clear-cache": "POST - Clear schema cache",
|
||||
"/models": "GET - List saved model configurations",
|
||||
"/models": "POST - Save a new model configuration",
|
||||
"/models/{model_name}": "DELETE - Delete a model configuration",
|
||||
"/saved-requests": "GET - List saved API requests"
|
||||
}
|
||||
}
|
||||
|
||||
@app.post("/scrape", response_model=ScrapeResponse)
|
||||
async def scrape_website_endpoint(request: ScrapeRequest):
|
||||
"""
|
||||
Scrape structured data from any website.
|
||||
|
||||
This endpoint:
|
||||
1. Takes a URL and plain English query
|
||||
2. Generates a custom scraper using AI
|
||||
3. Returns structured data
|
||||
"""
|
||||
try:
|
||||
# Save the API request
|
||||
headers = {"Content-Type": "application/json"}
|
||||
body = {
|
||||
"url": str(request.url),
|
||||
"query": request.query,
|
||||
"model_name": request.model_name
|
||||
}
|
||||
|
||||
result = await scraper_agent.scrape_data(
|
||||
url=str(request.url),
|
||||
query=request.query,
|
||||
model_name=request.model_name
|
||||
)
|
||||
|
||||
response_data = ScrapeResponse(
|
||||
success=True,
|
||||
url=result["url"],
|
||||
query=result["query"],
|
||||
extracted_data=result["extracted_data"],
|
||||
schema_used=result["schema_used"],
|
||||
timestamp=result["timestamp"]
|
||||
)
|
||||
|
||||
# Save the request with response
|
||||
save_api_request(
|
||||
endpoint="/scrape",
|
||||
method="POST",
|
||||
headers=headers,
|
||||
body=body,
|
||||
response=response_data.dict()
|
||||
)
|
||||
|
||||
return response_data
|
||||
|
||||
except Exception as e:
|
||||
# Save the failed request
|
||||
headers = {"Content-Type": "application/json"}
|
||||
body = {
|
||||
"url": str(request.url),
|
||||
"query": request.query,
|
||||
"model_name": request.model_name
|
||||
}
|
||||
|
||||
save_api_request(
|
||||
endpoint="/scrape",
|
||||
method="POST",
|
||||
headers=headers,
|
||||
body=body,
|
||||
response={"error": str(e)}
|
||||
)
|
||||
|
||||
raise HTTPException(status_code=500, detail=f"Scraping failed: {str(e)}")
|
||||
|
||||
@app.post("/scrape-with-llm", response_model=ScrapeResponse)
|
||||
async def scrape_website_endpoint_with_llm(request: ScrapeRequest):
|
||||
"""
|
||||
Scrape structured data from any website using a custom LLM model.
|
||||
"""
|
||||
try:
|
||||
# Save the API request
|
||||
headers = {"Content-Type": "application/json"}
|
||||
body = {
|
||||
"url": str(request.url),
|
||||
"query": request.query,
|
||||
"model_name": request.model_name
|
||||
}
|
||||
|
||||
result = await scraper_agent.scrape_data_with_llm(
|
||||
url=str(request.url),
|
||||
query=request.query,
|
||||
model_name=request.model_name
|
||||
)
|
||||
|
||||
response_data = ScrapeResponse(
|
||||
success=True,
|
||||
url=result["url"],
|
||||
query=result["query"],
|
||||
extracted_data=result["extracted_data"],
|
||||
timestamp=result["timestamp"]
|
||||
)
|
||||
|
||||
# Save the request with response
|
||||
save_api_request(
|
||||
endpoint="/scrape-with-llm",
|
||||
method="POST",
|
||||
headers=headers,
|
||||
body=body,
|
||||
response=response_data.dict()
|
||||
)
|
||||
|
||||
return response_data
|
||||
|
||||
except Exception as e:
|
||||
# Save the failed request
|
||||
headers = {"Content-Type": "application/json"}
|
||||
body = {
|
||||
"url": str(request.url),
|
||||
"query": request.query,
|
||||
"model_name": request.model_name
|
||||
}
|
||||
|
||||
save_api_request(
|
||||
endpoint="/scrape-with-llm",
|
||||
method="POST",
|
||||
headers=headers,
|
||||
body=body,
|
||||
response={"error": str(e)}
|
||||
)
|
||||
|
||||
raise HTTPException(status_code=500, detail=f"Scraping failed: {str(e)}")
|
||||
|
||||
@app.get("/saved-requests")
|
||||
async def list_saved_requests():
|
||||
"""List all saved API requests."""
|
||||
try:
|
||||
requests = get_saved_requests()
|
||||
return {
|
||||
"success": True,
|
||||
"requests": [req.dict() for req in requests],
|
||||
"count": len(requests)
|
||||
}
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=f"Failed to list saved requests: {str(e)}")
|
||||
|
||||
@app.delete("/saved-requests/{request_id}")
|
||||
async def delete_saved_request(request_id: str):
|
||||
"""Delete a saved API request."""
|
||||
try:
|
||||
file_path = os.path.join("saved_requests", f"{request_id}.json")
|
||||
if os.path.exists(file_path):
|
||||
os.remove(file_path)
|
||||
return {
|
||||
"success": True,
|
||||
"message": f"Saved request '{request_id}' deleted successfully"
|
||||
}
|
||||
else:
|
||||
raise HTTPException(status_code=404, detail=f"Saved request '{request_id}' not found")
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=f"Failed to delete saved request: {str(e)}")
|
||||
|
||||
@app.get("/schemas")
|
||||
async def list_cached_schemas():
|
||||
"""List all cached schemas."""
|
||||
try:
|
||||
schemas = await scraper_agent.get_cached_schemas()
|
||||
return {
|
||||
"success": True,
|
||||
"cached_schemas": schemas,
|
||||
"count": len(schemas)
|
||||
}
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=f"Failed to list schemas: {str(e)}")
|
||||
|
||||
@app.post("/clear-cache")
|
||||
async def clear_schema_cache():
|
||||
"""Clear all cached schemas."""
|
||||
try:
|
||||
scraper_agent.clear_cache()
|
||||
return {
|
||||
"success": True,
|
||||
"message": "Schema cache cleared successfully"
|
||||
}
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=f"Failed to clear cache: {str(e)}")
|
||||
|
||||
@app.get("/models")
|
||||
async def list_models():
|
||||
"""List all saved model configurations."""
|
||||
try:
|
||||
models = scraper_agent.list_saved_models()
|
||||
return {
|
||||
"success": True,
|
||||
"models": models,
|
||||
"count": len(models)
|
||||
}
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=f"Failed to list models: {str(e)}")
|
||||
|
||||
@app.post("/models")
|
||||
async def save_model_config(request: ModelConfigRequest):
|
||||
"""Save a new model configuration."""
|
||||
try:
|
||||
success = scraper_agent.save_model_config(
|
||||
model_name=request.model_name,
|
||||
provider=request.provider,
|
||||
api_token=request.api_token
|
||||
)
|
||||
|
||||
if success:
|
||||
return {
|
||||
"success": True,
|
||||
"message": f"Model configuration '{request.model_name}' saved successfully"
|
||||
}
|
||||
else:
|
||||
raise HTTPException(status_code=500, detail="Failed to save model configuration")
|
||||
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=f"Failed to save model: {str(e)}")
|
||||
|
||||
@app.delete("/models/{model_name}")
|
||||
async def delete_model_config(model_name: str):
|
||||
"""Delete a model configuration."""
|
||||
try:
|
||||
success = scraper_agent.delete_model_config(model_name)
|
||||
|
||||
if success:
|
||||
return {
|
||||
"success": True,
|
||||
"message": f"Model configuration '{model_name}' deleted successfully"
|
||||
}
|
||||
else:
|
||||
raise HTTPException(status_code=404, detail=f"Model configuration '{model_name}' not found")
|
||||
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=f"Failed to delete model: {str(e)}")
|
||||
|
||||
@app.get("/health")
|
||||
async def health_check():
|
||||
"""Health check endpoint."""
|
||||
return {"status": "healthy", "service": "web-scraper-api"}
|
||||
|
||||
if __name__ == "__main__":
|
||||
uvicorn.run(app, host="0.0.0.0", port=8000)
|
||||
49
docs/examples/website-to-api/app.py
Normal file
49
docs/examples/website-to-api/app.py
Normal file
@@ -0,0 +1,49 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Startup script for the Web Scraper API with frontend interface.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import uvicorn
|
||||
from pathlib import Path
|
||||
|
||||
def main():
|
||||
# Check if static directory exists
|
||||
static_dir = Path("static")
|
||||
if not static_dir.exists():
|
||||
print("❌ Static directory not found!")
|
||||
print("Please make sure the 'static' directory exists with the frontend files.")
|
||||
sys.exit(1)
|
||||
|
||||
# Check if required frontend files exist
|
||||
required_files = ["index.html", "styles.css", "script.js"]
|
||||
missing_files = []
|
||||
|
||||
for file in required_files:
|
||||
if not (static_dir / file).exists():
|
||||
missing_files.append(file)
|
||||
|
||||
if missing_files:
|
||||
print(f"❌ Missing frontend files: {', '.join(missing_files)}")
|
||||
print("Please make sure all frontend files are present in the static directory.")
|
||||
sys.exit(1)
|
||||
|
||||
print("🚀 Starting Web Scraper API with Frontend Interface")
|
||||
print("=" * 50)
|
||||
print("📁 Static files found and ready to serve")
|
||||
print("🌐 Frontend will be available at: http://localhost:8000")
|
||||
print("🔌 API endpoints available at: http://localhost:8000/docs")
|
||||
print("=" * 50)
|
||||
|
||||
# Start the server
|
||||
uvicorn.run(
|
||||
"api_server:app",
|
||||
host="0.0.0.0",
|
||||
port=8000,
|
||||
reload=True,
|
||||
log_level="info"
|
||||
)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
BIN
docs/examples/website-to-api/assets/crawl4ai_logo.jpg
Normal file
BIN
docs/examples/website-to-api/assets/crawl4ai_logo.jpg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 5.8 KiB |
5
docs/examples/website-to-api/requirements.txt
Normal file
5
docs/examples/website-to-api/requirements.txt
Normal file
@@ -0,0 +1,5 @@
|
||||
crawl4ai
|
||||
fastapi
|
||||
uvicorn
|
||||
pydantic
|
||||
litellm
|
||||
201
docs/examples/website-to-api/static/index.html
Normal file
201
docs/examples/website-to-api/static/index.html
Normal file
@@ -0,0 +1,201 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Web2API Example</title>
|
||||
<link rel="stylesheet" href="/static/styles.css">
|
||||
<link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css" rel="stylesheet">
|
||||
</head>
|
||||
<body>
|
||||
<!-- Header -->
|
||||
<header class="header">
|
||||
<div class="header-content">
|
||||
<div class="logo">
|
||||
<img src="/assets/crawl4ai_logo.jpg" alt="Crawl4AI Logo" class="logo-image">
|
||||
<span>Web2API Example</span>
|
||||
</div>
|
||||
<nav class="nav-links">
|
||||
<a href="#" class="nav-link active" data-page="scrape">Scrape</a>
|
||||
<a href="#" class="nav-link" data-page="models">Models</a>
|
||||
<a href="#" class="nav-link" data-page="requests">API Requests</a>
|
||||
</nav>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
<!-- Main Content -->
|
||||
<main class="main-content">
|
||||
<!-- Scrape Page -->
|
||||
<div id="scrape-page" class="page active">
|
||||
<div class="hero-section">
|
||||
<h1 class="hero-title">Turn Any Website Into An API</h1>
|
||||
<p class="hero-subtitle">This example shows how to turn any website into an API using Crawl4AI.</p>
|
||||
</div>
|
||||
|
||||
<!-- Workflow Demonstration -->
|
||||
<div class="workflow-demo">
|
||||
<div class="workflow-step">
|
||||
<h3 class="step-title">1. Your Request</h3>
|
||||
<div class="request-box">
|
||||
<div class="input-group">
|
||||
<label>URL:</label>
|
||||
<input type="url" id="url" name="url" placeholder="https://example-bookstore.com/new-releases" required>
|
||||
</div>
|
||||
<div class="input-group">
|
||||
<label>QUERY:</label>
|
||||
<textarea id="query" name="query" placeholder="Extract all the book titles, their authors, and the biography of the author" required></textarea>
|
||||
</div>
|
||||
<div class="form-options">
|
||||
<div class="option-group">
|
||||
<label for="scraping-approach">Approach:</label>
|
||||
<select id="scraping-approach" name="scraping_approach">
|
||||
<option value="llm">LLM-based (More Flexible)</option>
|
||||
<option value="schema">Schema-based (Uses LLM once!)</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="option-group">
|
||||
<label for="model-select">Model:</label>
|
||||
<select id="model-select" name="model_name" required>
|
||||
<option value="">Select a Model</option>
|
||||
</select>
|
||||
</div>
|
||||
</div>
|
||||
<button type="submit" id="extract-btn" class="extract-btn">
|
||||
<i class="fas fa-magic"></i>
|
||||
Extract Data
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="workflow-arrow">→</div>
|
||||
|
||||
<div class="workflow-step">
|
||||
<h3 class="step-title">2. Your Instant API & Data</h3>
|
||||
<div class="response-container">
|
||||
<div class="api-request-box">
|
||||
<label>API Request (cURL):</label>
|
||||
<pre id="curl-example">curl -X POST http://localhost:8000/scrape -H "Content-Type: application/json" -d '{"url": "...", "query": "..."}'
|
||||
|
||||
# Or for LLM-based approach:
|
||||
curl -X POST http://localhost:8000/scrape-with-llm -H "Content-Type: application/json" -d '{"url": "...", "query": "..."}'</pre>
|
||||
</div>
|
||||
<div class="json-response-box">
|
||||
<label>JSON Response:</label>
|
||||
<pre id="json-output">{
|
||||
"success": true,
|
||||
"extracted_data": [
|
||||
{
|
||||
"title": "Example Book",
|
||||
"author": "John Doe",
|
||||
"description": "A great book..."
|
||||
}
|
||||
]
|
||||
}</pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Results Section -->
|
||||
<div id="results-section" class="results-section" style="display: none;">
|
||||
<div class="results-header">
|
||||
<h2>Extracted Data</h2>
|
||||
<button id="copy-json" class="copy-btn">
|
||||
<i class="fas fa-copy"></i>
|
||||
Copy JSON
|
||||
</button>
|
||||
</div>
|
||||
<div class="results-content">
|
||||
<div class="result-info">
|
||||
<div class="info-item">
|
||||
<span class="label">URL:</span>
|
||||
<span id="result-url" class="value"></span>
|
||||
</div>
|
||||
<div class="info-item">
|
||||
<span class="label">Query:</span>
|
||||
<span id="result-query" class="value"></span>
|
||||
</div>
|
||||
<div class="info-item">
|
||||
<span class="label">Model Used:</span>
|
||||
<span id="result-model" class="value"></span>
|
||||
</div>
|
||||
</div>
|
||||
<div class="json-display">
|
||||
<pre id="actual-json-output"></pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Loading State -->
|
||||
<div id="loading" class="loading" style="display: none;">
|
||||
<div class="spinner"></div>
|
||||
<p>AI is analyzing the website and extracting data...</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Models Page -->
|
||||
<div id="models-page" class="page">
|
||||
<div class="models-header">
|
||||
<h1>Model Configuration</h1>
|
||||
<p>Configure and manage your AI model configurations</p>
|
||||
</div>
|
||||
|
||||
<div class="models-container">
|
||||
<!-- Add New Model Form -->
|
||||
<div class="model-form-section">
|
||||
<h3>Add New Model</h3>
|
||||
<form id="model-form" class="model-form">
|
||||
<div class="form-row">
|
||||
<div class="input-group">
|
||||
<label for="model-name">Model Name:</label>
|
||||
<input type="text" id="model-name" name="model_name" placeholder="my-gemini" required>
|
||||
</div>
|
||||
<div class="input-group">
|
||||
<label for="provider">Provider:</label>
|
||||
<input type="text" id="provider" name="provider" placeholder="gemini/gemini-2.5-flash" required>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="input-group">
|
||||
<label for="api-token">API Token:</label>
|
||||
<input type="password" id="api-token" name="api_token" placeholder="Enter your API token" required>
|
||||
</div>
|
||||
|
||||
<button type="submit" class="save-btn">
|
||||
<i class="fas fa-save"></i>
|
||||
Save Model
|
||||
</button>
|
||||
</form>
|
||||
</div>
|
||||
|
||||
<!-- Saved Models List -->
|
||||
<div class="saved-models-section">
|
||||
<h3>Saved Models</h3>
|
||||
<div id="models-list" class="models-list">
|
||||
<!-- Models will be loaded here -->
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- API Requests Page -->
|
||||
<div id="requests-page" class="page">
|
||||
<div class="requests-header">
|
||||
<h1>Saved API Requests</h1>
|
||||
<p>View and manage your previous API requests</p>
|
||||
</div>
|
||||
|
||||
<div class="requests-container">
|
||||
<div class="requests-list" id="requests-list">
|
||||
<!-- Saved requests will be loaded here -->
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</main>
|
||||
|
||||
<!-- Toast Notifications -->
|
||||
<div id="toast-container" class="toast-container"></div>
|
||||
|
||||
<script src="/static/script.js"></script>
|
||||
</body>
|
||||
</html>
|
||||
401
docs/examples/website-to-api/static/script.js
Normal file
401
docs/examples/website-to-api/static/script.js
Normal file
@@ -0,0 +1,401 @@
|
||||
// API Configuration
|
||||
const API_BASE_URL = 'http://localhost:8000';
|
||||
|
||||
// DOM Elements
|
||||
const navLinks = document.querySelectorAll('.nav-link');
|
||||
const pages = document.querySelectorAll('.page');
|
||||
const scrapeForm = document.getElementById('scrape-form');
|
||||
const modelForm = document.getElementById('model-form');
|
||||
const modelSelect = document.getElementById('model-select');
|
||||
const modelsList = document.getElementById('models-list');
|
||||
const resultsSection = document.getElementById('results-section');
|
||||
const loadingSection = document.getElementById('loading');
|
||||
const copyJsonBtn = document.getElementById('copy-json');
|
||||
|
||||
// Navigation
|
||||
navLinks.forEach(link => {
|
||||
link.addEventListener('click', (e) => {
|
||||
e.preventDefault();
|
||||
const targetPage = link.dataset.page;
|
||||
|
||||
// Update active nav link
|
||||
navLinks.forEach(l => l.classList.remove('active'));
|
||||
link.classList.add('active');
|
||||
|
||||
// Show target page
|
||||
pages.forEach(page => page.classList.remove('active'));
|
||||
document.getElementById(`${targetPage}-page`).classList.add('active');
|
||||
|
||||
// Load data for the page
|
||||
if (targetPage === 'models') {
|
||||
loadModels();
|
||||
} else if (targetPage === 'requests') {
|
||||
loadSavedRequests();
|
||||
}
|
||||
});
|
||||
});
|
||||
|
||||
// Scrape Form Handler
|
||||
document.getElementById('extract-btn').addEventListener('click', async (e) => {
|
||||
e.preventDefault();
|
||||
|
||||
// Scroll to results section immediately when button is clicked
|
||||
document.getElementById('results-section').scrollIntoView({
|
||||
behavior: 'smooth',
|
||||
block: 'start'
|
||||
});
|
||||
|
||||
const url = document.getElementById('url').value;
|
||||
const query = document.getElementById('query').value;
|
||||
const headless = true; // Always use headless mode
|
||||
const model_name = document.getElementById('model-select').value || null;
|
||||
const scraping_approach = document.getElementById('scraping-approach').value;
|
||||
|
||||
if (!url || !query) {
|
||||
showToast('Please fill in both URL and query fields', 'error');
|
||||
return;
|
||||
}
|
||||
|
||||
if (!model_name) {
|
||||
showToast('Please select a model from the dropdown or add one from the Models page', 'error');
|
||||
return;
|
||||
}
|
||||
|
||||
const data = {
|
||||
url: url,
|
||||
query: query,
|
||||
headless: headless,
|
||||
model_name: model_name
|
||||
};
|
||||
|
||||
// Show loading state
|
||||
showLoading(true);
|
||||
hideResults();
|
||||
|
||||
try {
|
||||
// Choose endpoint based on scraping approach
|
||||
const endpoint = scraping_approach === 'llm' ? '/scrape-with-llm' : '/scrape';
|
||||
|
||||
const response = await fetch(`${API_BASE_URL}${endpoint}`, {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json'
|
||||
},
|
||||
body: JSON.stringify(data)
|
||||
});
|
||||
|
||||
const result = await response.json();
|
||||
|
||||
if (response.ok) {
|
||||
displayResults(result);
|
||||
showToast(`Data extracted successfully using ${scraping_approach === 'llm' ? 'LLM-based' : 'Schema-based'} approach!`, 'success');
|
||||
} else {
|
||||
throw new Error(result.detail || 'Failed to extract data');
|
||||
}
|
||||
} catch (error) {
|
||||
console.error('Scraping error:', error);
|
||||
showToast(`Error: ${error.message}`, 'error');
|
||||
} finally {
|
||||
showLoading(false);
|
||||
}
|
||||
});
|
||||
|
||||
// Model Form Handler
|
||||
modelForm.addEventListener('submit', async (e) => {
|
||||
e.preventDefault();
|
||||
|
||||
const formData = new FormData(modelForm);
|
||||
const data = {
|
||||
model_name: formData.get('model_name'),
|
||||
provider: formData.get('provider'),
|
||||
api_token: formData.get('api_token')
|
||||
};
|
||||
|
||||
try {
|
||||
const response = await fetch(`${API_BASE_URL}/models`, {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json'
|
||||
},
|
||||
body: JSON.stringify(data)
|
||||
});
|
||||
|
||||
const result = await response.json();
|
||||
|
||||
if (response.ok) {
|
||||
showToast('Model saved successfully!', 'success');
|
||||
modelForm.reset();
|
||||
loadModels();
|
||||
loadModelSelect();
|
||||
} else {
|
||||
throw new Error(result.detail || 'Failed to save model');
|
||||
}
|
||||
} catch (error) {
|
||||
console.error('Model save error:', error);
|
||||
showToast(`Error: ${error.message}`, 'error');
|
||||
}
|
||||
});
|
||||
|
||||
// Copy JSON Button
|
||||
copyJsonBtn.addEventListener('click', () => {
|
||||
const actualJsonOutput = document.getElementById('actual-json-output');
|
||||
const textToCopy = actualJsonOutput.textContent;
|
||||
|
||||
navigator.clipboard.writeText(textToCopy).then(() => {
|
||||
showToast('JSON copied to clipboard!', 'success');
|
||||
}).catch(() => {
|
||||
showToast('Failed to copy JSON', 'error');
|
||||
});
|
||||
});
|
||||
|
||||
// Load Models
|
||||
async function loadModels() {
|
||||
try {
|
||||
const response = await fetch(`${API_BASE_URL}/models`);
|
||||
const result = await response.json();
|
||||
|
||||
if (response.ok) {
|
||||
displayModels(result.models);
|
||||
} else {
|
||||
throw new Error(result.detail || 'Failed to load models');
|
||||
}
|
||||
} catch (error) {
|
||||
console.error('Load models error:', error);
|
||||
showToast(`Error: ${error.message}`, 'error');
|
||||
}
|
||||
}
|
||||
|
||||
// Display Models
|
||||
function displayModels(models) {
|
||||
if (models.length === 0) {
|
||||
modelsList.innerHTML = '<p style="text-align: center; color: #7f8c8d; padding: 2rem;">No models saved yet. Add your first model above!</p>';
|
||||
return;
|
||||
}
|
||||
|
||||
modelsList.innerHTML = models.map(model => `
|
||||
<div class="model-card">
|
||||
<div class="model-info">
|
||||
<div class="model-name">${model}</div>
|
||||
<div class="model-provider">Model Configuration</div>
|
||||
</div>
|
||||
<div class="model-actions">
|
||||
<button class="btn btn-danger" onclick="deleteModel('${model}')">
|
||||
<i class="fas fa-trash"></i>
|
||||
Delete
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
// Delete Model
|
||||
async function deleteModel(modelName) {
|
||||
if (!confirm(`Are you sure you want to delete the model "${modelName}"?`)) {
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
const response = await fetch(`${API_BASE_URL}/models/${modelName}`, {
|
||||
method: 'DELETE'
|
||||
});
|
||||
|
||||
const result = await response.json();
|
||||
|
||||
if (response.ok) {
|
||||
showToast('Model deleted successfully!', 'success');
|
||||
loadModels();
|
||||
loadModelSelect();
|
||||
} else {
|
||||
throw new Error(result.detail || 'Failed to delete model');
|
||||
}
|
||||
} catch (error) {
|
||||
console.error('Delete model error:', error);
|
||||
showToast(`Error: ${error.message}`, 'error');
|
||||
}
|
||||
}
|
||||
|
||||
// Load Model Select Options
|
||||
async function loadModelSelect() {
|
||||
try {
|
||||
const response = await fetch(`${API_BASE_URL}/models`);
|
||||
const result = await response.json();
|
||||
|
||||
if (response.ok) {
|
||||
// Clear existing options
|
||||
modelSelect.innerHTML = '<option value="">Select a Model</option>';
|
||||
|
||||
// Add model options
|
||||
result.models.forEach(model => {
|
||||
const option = document.createElement('option');
|
||||
option.value = model;
|
||||
option.textContent = model;
|
||||
modelSelect.appendChild(option);
|
||||
});
|
||||
}
|
||||
} catch (error) {
|
||||
console.error('Load model select error:', error);
|
||||
}
|
||||
}
|
||||
|
||||
// Display Results
|
||||
function displayResults(result) {
|
||||
// Update result info
|
||||
document.getElementById('result-url').textContent = result.url;
|
||||
document.getElementById('result-query').textContent = result.query;
|
||||
document.getElementById('result-model').textContent = result.model_name || 'Default Model';
|
||||
|
||||
// Display JSON in the actual results section
|
||||
const actualJsonOutput = document.getElementById('actual-json-output');
|
||||
actualJsonOutput.textContent = JSON.stringify(result.extracted_data, null, 2);
|
||||
|
||||
// Don't update the sample JSON in the workflow demo - keep it as example
|
||||
|
||||
// Update the cURL example based on the approach used
|
||||
const scraping_approach = document.getElementById('scraping-approach').value;
|
||||
const endpoint = scraping_approach === 'llm' ? '/scrape-with-llm' : '/scrape';
|
||||
const curlExample = document.getElementById('curl-example');
|
||||
curlExample.textContent = `curl -X POST http://localhost:8000${endpoint} -H "Content-Type: application/json" -d '{"url": "${result.url}", "query": "${result.query}"}'`;
|
||||
|
||||
// Show results section
|
||||
resultsSection.style.display = 'block';
|
||||
resultsSection.scrollIntoView({ behavior: 'smooth' });
|
||||
}
|
||||
|
||||
// Show/Hide Loading
|
||||
function showLoading(show) {
|
||||
loadingSection.style.display = show ? 'block' : 'none';
|
||||
}
|
||||
|
||||
// Hide Results
|
||||
function hideResults() {
|
||||
resultsSection.style.display = 'none';
|
||||
}
|
||||
|
||||
// Toast Notifications
|
||||
function showToast(message, type = 'info') {
|
||||
const toastContainer = document.getElementById('toast-container');
|
||||
const toast = document.createElement('div');
|
||||
toast.className = `toast ${type}`;
|
||||
|
||||
const icon = type === 'success' ? 'fas fa-check-circle' :
|
||||
type === 'error' ? 'fas fa-exclamation-circle' :
|
||||
'fas fa-info-circle';
|
||||
|
||||
toast.innerHTML = `
|
||||
<i class="${icon}"></i>
|
||||
<span>${message}</span>
|
||||
`;
|
||||
|
||||
toastContainer.appendChild(toast);
|
||||
|
||||
// Auto remove after 5 seconds
|
||||
setTimeout(() => {
|
||||
toast.remove();
|
||||
}, 5000);
|
||||
}
|
||||
|
||||
// Load Saved Requests
|
||||
async function loadSavedRequests() {
|
||||
try {
|
||||
const response = await fetch(`${API_BASE_URL}/saved-requests`);
|
||||
const result = await response.json();
|
||||
|
||||
if (response.ok) {
|
||||
displaySavedRequests(result.requests);
|
||||
} else {
|
||||
throw new Error(result.detail || 'Failed to load saved requests');
|
||||
}
|
||||
} catch (error) {
|
||||
console.error('Load saved requests error:', error);
|
||||
showToast(`Error: ${error.message}`, 'error');
|
||||
}
|
||||
}
|
||||
|
||||
// Display Saved Requests
|
||||
function displaySavedRequests(requests) {
|
||||
const requestsList = document.getElementById('requests-list');
|
||||
|
||||
if (requests.length === 0) {
|
||||
requestsList.innerHTML = '<p style="text-align: center; color: #CCCCCC; padding: 2rem;">No saved API requests yet. Make your first request from the Scrape page!</p>';
|
||||
return;
|
||||
}
|
||||
|
||||
requestsList.innerHTML = requests.map(request => {
|
||||
const url = request.body.url;
|
||||
const query = request.body.query;
|
||||
const model = request.body.model_name || 'Default Model';
|
||||
const endpoint = request.endpoint;
|
||||
|
||||
// Create curl command
|
||||
const curlCommand = `curl -X POST http://localhost:8000${endpoint} \\
|
||||
-H "Content-Type: application/json" \\
|
||||
-d '{
|
||||
"url": "${url}",
|
||||
"query": "${query}",
|
||||
"model_name": "${model}"
|
||||
}'`;
|
||||
|
||||
return `
|
||||
<div class="request-card">
|
||||
<div class="request-header">
|
||||
<div class="request-info">
|
||||
<div class="request-url">${url}</div>
|
||||
<div class="request-query">${query}</div>
|
||||
</div>
|
||||
<div class="request-actions">
|
||||
<button class="btn-danger" onclick="deleteSavedRequest('${request.id}')">
|
||||
<i class="fas fa-trash"></i>
|
||||
Delete
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="request-curl">
|
||||
<h4>cURL Command:</h4>
|
||||
<pre>${curlCommand}</pre>
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
}).join('');
|
||||
}
|
||||
|
||||
// Delete Saved Request
|
||||
async function deleteSavedRequest(requestId) {
|
||||
if (!confirm('Are you sure you want to delete this saved request?')) {
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
const response = await fetch(`${API_BASE_URL}/saved-requests/${requestId}`, {
|
||||
method: 'DELETE'
|
||||
});
|
||||
|
||||
const result = await response.json();
|
||||
|
||||
if (response.ok) {
|
||||
showToast('Saved request deleted successfully!', 'success');
|
||||
loadSavedRequests();
|
||||
} else {
|
||||
throw new Error(result.detail || 'Failed to delete saved request');
|
||||
}
|
||||
} catch (error) {
|
||||
console.error('Delete saved request error:', error);
|
||||
showToast(`Error: ${error.message}`, 'error');
|
||||
}
|
||||
}
|
||||
|
||||
// Initialize
|
||||
document.addEventListener('DOMContentLoaded', () => {
|
||||
loadModelSelect();
|
||||
|
||||
// Check if API is available
|
||||
fetch(`${API_BASE_URL}/health`)
|
||||
.then(response => {
|
||||
if (!response.ok) {
|
||||
showToast('Warning: API server might not be running', 'error');
|
||||
}
|
||||
})
|
||||
.catch(() => {
|
||||
showToast('Warning: Cannot connect to API server. Make sure it\'s running on localhost:8000', 'error');
|
||||
});
|
||||
});
|
||||
765
docs/examples/website-to-api/static/styles.css
Normal file
765
docs/examples/website-to-api/static/styles.css
Normal file
@@ -0,0 +1,765 @@
|
||||
/* Reset and Base Styles */
|
||||
* {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
|
||||
background: #000000;
|
||||
color: #FFFFFF;
|
||||
line-height: 1.6;
|
||||
font-size: 16px;
|
||||
}
|
||||
|
||||
/* Header */
|
||||
.header {
|
||||
border-bottom: 1px solid #333;
|
||||
padding: 1rem 0;
|
||||
background: #000000;
|
||||
position: sticky;
|
||||
top: 0;
|
||||
z-index: 100;
|
||||
}
|
||||
|
||||
.header-content {
|
||||
max-width: 1200px;
|
||||
margin: 0 auto;
|
||||
padding: 0 2rem;
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
}
|
||||
|
||||
.logo {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 0.5rem;
|
||||
font-size: 1.5rem;
|
||||
font-weight: 600;
|
||||
color: #FFFFFF;
|
||||
}
|
||||
|
||||
.logo-image {
|
||||
width: 40px;
|
||||
height: 40px;
|
||||
border-radius: 4px;
|
||||
object-fit: contain;
|
||||
}
|
||||
|
||||
.nav-links {
|
||||
display: flex;
|
||||
gap: 2rem;
|
||||
}
|
||||
|
||||
.nav-link {
|
||||
color: #CCCCCC;
|
||||
text-decoration: none;
|
||||
font-weight: 500;
|
||||
transition: color 0.2s ease;
|
||||
}
|
||||
|
||||
.nav-link:hover,
|
||||
.nav-link.active {
|
||||
color: #FFFFFF;
|
||||
}
|
||||
|
||||
/* Main Content */
|
||||
.main-content {
|
||||
max-width: 1200px;
|
||||
margin: 0 auto;
|
||||
padding: 2rem;
|
||||
}
|
||||
|
||||
.page {
|
||||
display: none;
|
||||
}
|
||||
|
||||
.page.active {
|
||||
display: block;
|
||||
}
|
||||
|
||||
/* Hero Section */
|
||||
.hero-section {
|
||||
text-align: center;
|
||||
margin-bottom: 4rem;
|
||||
padding: 2rem 0;
|
||||
}
|
||||
|
||||
.hero-title {
|
||||
font-size: 3rem;
|
||||
font-weight: 700;
|
||||
color: #FFFFFF;
|
||||
margin-bottom: 1rem;
|
||||
line-height: 1.2;
|
||||
}
|
||||
|
||||
.hero-subtitle {
|
||||
font-size: 1.25rem;
|
||||
color: #CCCCCC;
|
||||
max-width: 600px;
|
||||
margin: 0 auto;
|
||||
}
|
||||
|
||||
/* Workflow Demo */
|
||||
.workflow-demo {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr auto 1fr;
|
||||
gap: 2rem;
|
||||
align-items: start;
|
||||
margin-bottom: 4rem;
|
||||
}
|
||||
|
||||
.workflow-step {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
.step-title {
|
||||
font-size: 1.25rem;
|
||||
font-weight: 600;
|
||||
color: #FFFFFF;
|
||||
text-align: center;
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
|
||||
.workflow-arrow {
|
||||
font-size: 2rem;
|
||||
font-weight: 700;
|
||||
color: #09b5a5;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
margin-top: 20rem;
|
||||
}
|
||||
|
||||
/* Request Box */
|
||||
.request-box {
|
||||
border: 2px solid #333;
|
||||
border-radius: 8px;
|
||||
padding: 2rem;
|
||||
background: #111111;
|
||||
}
|
||||
|
||||
.input-group {
|
||||
margin-bottom: 1.5rem;
|
||||
}
|
||||
|
||||
.input-group label {
|
||||
display: block;
|
||||
font-family: 'Courier New', monospace;
|
||||
font-weight: 600;
|
||||
color: #FFFFFF;
|
||||
margin-bottom: 0.5rem;
|
||||
font-size: 0.9rem;
|
||||
}
|
||||
|
||||
.input-group input,
|
||||
.input-group textarea,
|
||||
.input-group select {
|
||||
width: 100%;
|
||||
padding: 0.75rem;
|
||||
border: 1px solid #333;
|
||||
border-radius: 4px;
|
||||
font-family: 'Courier New', monospace;
|
||||
font-size: 0.9rem;
|
||||
background: #1A1A1A;
|
||||
color: #FFFFFF;
|
||||
transition: border-color 0.2s ease;
|
||||
}
|
||||
|
||||
.input-group input:focus,
|
||||
.input-group textarea:focus,
|
||||
.input-group select:focus {
|
||||
outline: none;
|
||||
border-color: #09b5a5;
|
||||
}
|
||||
|
||||
.input-group textarea {
|
||||
min-height: 80px;
|
||||
resize: vertical;
|
||||
}
|
||||
|
||||
.form-options {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 1rem;
|
||||
margin-bottom: 1.5rem;
|
||||
}
|
||||
|
||||
.option-group {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
gap: 0.5rem;
|
||||
}
|
||||
|
||||
.option-group label {
|
||||
font-family: 'Courier New', monospace;
|
||||
font-weight: 600;
|
||||
color: #FFFFFF;
|
||||
font-size: 0.9rem;
|
||||
}
|
||||
|
||||
.option-group input[type="checkbox"] {
|
||||
width: auto;
|
||||
margin-right: 0.5rem;
|
||||
}
|
||||
|
||||
.extract-btn {
|
||||
width: 100%;
|
||||
padding: 1rem;
|
||||
background: #09b5a5;
|
||||
color: #000000;
|
||||
border: none;
|
||||
border-radius: 4px;
|
||||
font-size: 1rem;
|
||||
font-weight: 600;
|
||||
cursor: pointer;
|
||||
transition: background-color 0.2s ease;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
gap: 0.5rem;
|
||||
}
|
||||
|
||||
.extract-btn:hover {
|
||||
background: #09b5a5;
|
||||
}
|
||||
|
||||
/* Dropdown specific styling */
|
||||
select,
|
||||
.input-group select,
|
||||
.option-group select {
|
||||
cursor: pointer !important;
|
||||
appearance: none !important;
|
||||
-webkit-appearance: none !important;
|
||||
-moz-appearance: none !important;
|
||||
-ms-appearance: none !important;
|
||||
background-image: url("data:image/svg+xml;charset=UTF-8,%3csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 24 24' fill='none' stroke='%23FFFFFF' stroke-width='2' stroke-linecap='round' stroke-linejoin='round'%3e%3cpolyline points='6,9 12,15 18,9'%3e%3c/polyline%3e%3c/svg%3e") !important;
|
||||
background-repeat: no-repeat !important;
|
||||
background-position: right 0.75rem center !important;
|
||||
background-size: 1rem !important;
|
||||
padding-right: 2.5rem !important;
|
||||
border: 1px solid #333 !important;
|
||||
border-radius: 4px !important;
|
||||
font-family: 'Courier New', monospace !important;
|
||||
font-size: 0.9rem !important;
|
||||
background-color: #1A1A1A !important;
|
||||
color: #FFFFFF !important;
|
||||
}
|
||||
|
||||
select:hover,
|
||||
.input-group select:hover,
|
||||
.option-group select:hover {
|
||||
border-color: #09b5a5 !important;
|
||||
}
|
||||
|
||||
select:focus,
|
||||
.input-group select:focus,
|
||||
.option-group select:focus {
|
||||
outline: none !important;
|
||||
border-color: #09b5a5 !important;
|
||||
}
|
||||
|
||||
select option,
|
||||
.input-group select option,
|
||||
.option-group select option {
|
||||
background: #1A1A1A !important;
|
||||
color: #FFFFFF !important;
|
||||
padding: 0.5rem !important;
|
||||
}
|
||||
|
||||
/* Response Container */
|
||||
.response-container {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
.api-request-box,
|
||||
.json-response-box {
|
||||
border: 2px solid #333;
|
||||
border-radius: 8px;
|
||||
padding: 1.5rem;
|
||||
background: #111111;
|
||||
}
|
||||
|
||||
.api-request-box label,
|
||||
.json-response-box label {
|
||||
display: block;
|
||||
font-family: 'Courier New', monospace;
|
||||
font-weight: 600;
|
||||
color: #FFFFFF;
|
||||
margin-bottom: 0.5rem;
|
||||
font-size: 0.9rem;
|
||||
}
|
||||
|
||||
.api-request-box pre,
|
||||
.json-response-box pre {
|
||||
font-family: 'Courier New', monospace;
|
||||
font-size: 0.85rem;
|
||||
line-height: 1.5;
|
||||
color: #FFFFFF;
|
||||
background: #1A1A1A;
|
||||
padding: 1rem;
|
||||
border-radius: 4px;
|
||||
overflow-x: auto;
|
||||
white-space: pre-wrap;
|
||||
word-break: break-all;
|
||||
}
|
||||
|
||||
/* Results Section */
|
||||
.results-section {
|
||||
border: 2px solid #333;
|
||||
border-radius: 8px;
|
||||
overflow: hidden;
|
||||
margin-top: 2rem;
|
||||
background: #111111;
|
||||
}
|
||||
|
||||
.results-header {
|
||||
background: #1A1A1A;
|
||||
color: #FFFFFF;
|
||||
padding: 1rem 1.5rem;
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
border-bottom: 1px solid #333;
|
||||
}
|
||||
|
||||
.results-header h2 {
|
||||
font-size: 1.25rem;
|
||||
font-weight: 600;
|
||||
color: #FFFFFF;
|
||||
}
|
||||
|
||||
.copy-btn {
|
||||
background: #09b5a5;
|
||||
color: #000000;
|
||||
border: none;
|
||||
padding: 0.5rem 1rem;
|
||||
border-radius: 4px;
|
||||
font-size: 0.9rem;
|
||||
font-weight: 600;
|
||||
cursor: pointer;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 0.5rem;
|
||||
transition: background-color 0.2s ease;
|
||||
}
|
||||
|
||||
.copy-btn:hover {
|
||||
background: #09b5a5;
|
||||
}
|
||||
|
||||
.results-content {
|
||||
padding: 1.5rem;
|
||||
}
|
||||
|
||||
.result-info {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
|
||||
gap: 1rem;
|
||||
margin-bottom: 1.5rem;
|
||||
padding: 1rem;
|
||||
background: #1A1A1A;
|
||||
border-radius: 4px;
|
||||
border: 1px solid #333;
|
||||
}
|
||||
|
||||
.info-item {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
gap: 0.25rem;
|
||||
}
|
||||
|
||||
.info-item .label {
|
||||
font-weight: 600;
|
||||
color: #FFFFFF;
|
||||
font-size: 0.9rem;
|
||||
}
|
||||
|
||||
.info-item .value {
|
||||
color: #CCCCCC;
|
||||
word-break: break-all;
|
||||
}
|
||||
|
||||
.json-display {
|
||||
background: #1A1A1A;
|
||||
border-radius: 4px;
|
||||
overflow: hidden;
|
||||
border: 1px solid #333;
|
||||
}
|
||||
|
||||
.json-display pre {
|
||||
color: #FFFFFF;
|
||||
padding: 1.5rem;
|
||||
margin: 0;
|
||||
overflow-x: auto;
|
||||
font-family: 'Courier New', monospace;
|
||||
font-size: 0.9rem;
|
||||
line-height: 1.5;
|
||||
}
|
||||
|
||||
/* Loading State */
|
||||
.loading {
|
||||
text-align: center;
|
||||
padding: 3rem;
|
||||
}
|
||||
|
||||
.spinner {
|
||||
width: 40px;
|
||||
height: 40px;
|
||||
border: 3px solid #333;
|
||||
border-top: 3px solid #09b5a5;
|
||||
border-radius: 50%;
|
||||
animation: spin 1s linear infinite;
|
||||
margin: 0 auto 1rem;
|
||||
}
|
||||
|
||||
@keyframes spin {
|
||||
0% { transform: rotate(0deg); }
|
||||
100% { transform: rotate(360deg); }
|
||||
}
|
||||
|
||||
/* Models Page */
|
||||
.models-header {
|
||||
text-align: center;
|
||||
margin-bottom: 3rem;
|
||||
}
|
||||
|
||||
.models-header h1 {
|
||||
font-size: 2.5rem;
|
||||
font-weight: 700;
|
||||
color: #FFFFFF;
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
|
||||
.models-header p {
|
||||
font-size: 1.1rem;
|
||||
color: #CCCCCC;
|
||||
}
|
||||
|
||||
/* API Requests Page */
|
||||
.requests-header {
|
||||
text-align: center;
|
||||
margin-bottom: 3rem;
|
||||
}
|
||||
|
||||
.requests-header h1 {
|
||||
font-size: 2.5rem;
|
||||
font-weight: 700;
|
||||
color: #FFFFFF;
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
|
||||
.requests-header p {
|
||||
font-size: 1.1rem;
|
||||
color: #CCCCCC;
|
||||
}
|
||||
|
||||
.requests-container {
|
||||
max-width: 1200px;
|
||||
margin: 0 auto;
|
||||
}
|
||||
|
||||
.requests-list {
|
||||
display: grid;
|
||||
gap: 1.5rem;
|
||||
}
|
||||
|
||||
.request-card {
|
||||
border: 2px solid #333;
|
||||
border-radius: 8px;
|
||||
padding: 1.5rem;
|
||||
background: #111111;
|
||||
transition: border-color 0.2s ease;
|
||||
}
|
||||
|
||||
.request-card:hover {
|
||||
border-color: #09b5a5;
|
||||
}
|
||||
|
||||
.request-header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
margin-bottom: 1rem;
|
||||
padding-bottom: 1rem;
|
||||
border-bottom: 1px solid #333;
|
||||
}
|
||||
|
||||
.request-info {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
gap: 0.5rem;
|
||||
}
|
||||
|
||||
.request-url {
|
||||
font-family: 'Courier New', monospace;
|
||||
font-weight: 600;
|
||||
color: #09b5a5;
|
||||
font-size: 1.1rem;
|
||||
word-break: break-all;
|
||||
}
|
||||
|
||||
.request-query {
|
||||
color: #CCCCCC;
|
||||
font-size: 0.9rem;
|
||||
margin-top: 0.5rem;
|
||||
word-break: break-all;
|
||||
}
|
||||
|
||||
.request-actions {
|
||||
display: flex;
|
||||
gap: 0.5rem;
|
||||
}
|
||||
|
||||
.request-curl {
|
||||
background: #1A1A1A;
|
||||
border: 1px solid #333;
|
||||
border-radius: 4px;
|
||||
padding: 1rem;
|
||||
margin-top: 1rem;
|
||||
}
|
||||
|
||||
.request-curl h4 {
|
||||
color: #FFFFFF;
|
||||
font-size: 0.9rem;
|
||||
font-weight: 600;
|
||||
margin-bottom: 0.5rem;
|
||||
font-family: 'Courier New', monospace;
|
||||
}
|
||||
|
||||
.request-curl pre {
|
||||
color: #CCCCCC;
|
||||
font-size: 0.8rem;
|
||||
line-height: 1.4;
|
||||
overflow-x: auto;
|
||||
white-space: pre-wrap;
|
||||
word-break: break-all;
|
||||
background: #111111;
|
||||
padding: 0.75rem;
|
||||
border-radius: 4px;
|
||||
border: 1px solid #333;
|
||||
}
|
||||
|
||||
.models-container {
|
||||
max-width: 800px;
|
||||
margin: 0 auto;
|
||||
}
|
||||
|
||||
.model-form-section {
|
||||
border: 2px solid #333;
|
||||
border-radius: 8px;
|
||||
padding: 2rem;
|
||||
margin-bottom: 2rem;
|
||||
background: #111111;
|
||||
}
|
||||
|
||||
.model-form-section h3 {
|
||||
font-size: 1.25rem;
|
||||
font-weight: 600;
|
||||
color: #FFFFFF;
|
||||
margin-bottom: 1.5rem;
|
||||
}
|
||||
|
||||
.model-form {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
gap: 1.5rem;
|
||||
}
|
||||
|
||||
.form-row {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
.save-btn {
|
||||
padding: 1rem;
|
||||
background: #09b5a5;
|
||||
color: #000000;
|
||||
border: none;
|
||||
border-radius: 4px;
|
||||
font-size: 1rem;
|
||||
font-weight: 600;
|
||||
cursor: pointer;
|
||||
transition: background-color 0.2s ease;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
gap: 0.5rem;
|
||||
}
|
||||
|
||||
.save-btn:hover {
|
||||
background: #09b5a5;
|
||||
}
|
||||
|
||||
.saved-models-section h3 {
|
||||
font-size: 1.25rem;
|
||||
font-weight: 600;
|
||||
color: #FFFFFF;
|
||||
margin-bottom: 1.5rem;
|
||||
}
|
||||
|
||||
.models-list {
|
||||
display: grid;
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
.model-card {
|
||||
border: 2px solid #333;
|
||||
border-radius: 8px;
|
||||
padding: 1.5rem;
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
transition: border-color 0.2s ease;
|
||||
background: #111111;
|
||||
}
|
||||
|
||||
.model-card:hover {
|
||||
border-color: #09b5a5;
|
||||
}
|
||||
|
||||
.model-info {
|
||||
flex: 1;
|
||||
}
|
||||
|
||||
.model-name {
|
||||
font-weight: 600;
|
||||
color: #FFFFFF;
|
||||
font-size: 1.1rem;
|
||||
margin-bottom: 0.5rem;
|
||||
}
|
||||
|
||||
.model-provider {
|
||||
color: #CCCCCC;
|
||||
font-size: 0.9rem;
|
||||
}
|
||||
|
||||
.model-actions {
|
||||
display: flex;
|
||||
gap: 0.5rem;
|
||||
}
|
||||
|
||||
.btn-danger {
|
||||
background: #FF4444;
|
||||
color: #FFFFFF;
|
||||
border: none;
|
||||
padding: 0.5rem 1rem;
|
||||
border-radius: 4px;
|
||||
font-size: 0.9rem;
|
||||
font-weight: 600;
|
||||
cursor: pointer;
|
||||
transition: background-color 0.2s ease;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 0.5rem;
|
||||
}
|
||||
|
||||
.btn-danger:hover {
|
||||
background: #CC3333;
|
||||
}
|
||||
|
||||
|
||||
|
||||
/* Toast Notifications */
|
||||
.toast-container {
|
||||
position: fixed;
|
||||
top: 20px;
|
||||
right: 20px;
|
||||
z-index: 1000;
|
||||
}
|
||||
|
||||
.toast {
|
||||
background: #111111;
|
||||
border: 2px solid #333;
|
||||
border-radius: 4px;
|
||||
padding: 1rem 1.5rem;
|
||||
margin-bottom: 0.5rem;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 0.5rem;
|
||||
animation: slideIn 0.3s ease;
|
||||
max-width: 400px;
|
||||
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.3);
|
||||
color: #FFFFFF;
|
||||
}
|
||||
|
||||
.toast.success {
|
||||
border-color: #09b5a5;
|
||||
background: #0A1A1A;
|
||||
}
|
||||
|
||||
.toast.error {
|
||||
border-color: #FF4444;
|
||||
background: #1A0A0A;
|
||||
}
|
||||
|
||||
.toast.info {
|
||||
border-color: #09b5a5;
|
||||
background: #0A1A1A;
|
||||
}
|
||||
|
||||
@keyframes slideIn {
|
||||
from {
|
||||
transform: translateX(100%);
|
||||
opacity: 0;
|
||||
}
|
||||
to {
|
||||
transform: translateX(0);
|
||||
opacity: 1;
|
||||
}
|
||||
}
|
||||
|
||||
/* Responsive Design */
|
||||
@media (max-width: 768px) {
|
||||
.header-content {
|
||||
padding: 0 1rem;
|
||||
}
|
||||
|
||||
.main-content {
|
||||
padding: 1rem;
|
||||
}
|
||||
|
||||
.hero-title {
|
||||
font-size: 2rem;
|
||||
}
|
||||
|
||||
.workflow-demo {
|
||||
grid-template-columns: 1fr;
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
.workflow-arrow {
|
||||
transform: rotate(90deg);
|
||||
margin: 1rem 0;
|
||||
}
|
||||
|
||||
.form-options {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
|
||||
.form-row {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
|
||||
.result-info {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
|
||||
.model-card {
|
||||
flex-direction: column;
|
||||
gap: 1rem;
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
.model-actions {
|
||||
width: 100%;
|
||||
justify-content: center;
|
||||
}
|
||||
}
|
||||
28
docs/examples/website-to-api/test_api.py
Normal file
28
docs/examples/website-to-api/test_api.py
Normal file
@@ -0,0 +1,28 @@
|
||||
import asyncio
|
||||
from web_scraper_lib import scrape_website
|
||||
import os
|
||||
|
||||
async def test_library():
|
||||
"""Test the mini library directly."""
|
||||
print("=== Testing Mini Library ===")
|
||||
|
||||
# Test 1: Scrape with a custom model
|
||||
url = "https://marketplace.mainstreet.co.in/collections/adidas-yeezy/products/adidas-yeezy-boost-350-v2-yecheil-non-reflective"
|
||||
query = "Extract the following data: Product name, Product price, Product description, Product size. DO NOT EXTRACT ANYTHING ELSE."
|
||||
if os.path.exists("models"):
|
||||
model_name = os.listdir("models")[0].split(".")[0]
|
||||
else:
|
||||
raise Exception("No models found in models directory")
|
||||
|
||||
print(f"Scraping: {url}")
|
||||
print(f"Query: {query}")
|
||||
|
||||
try:
|
||||
result = await scrape_website(url, query, model_name)
|
||||
print("✅ Library test successful!")
|
||||
print(f"Extracted data: {result['extracted_data']}")
|
||||
except Exception as e:
|
||||
print(f"❌ Library test failed: {e}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(test_library())
|
||||
67
docs/examples/website-to-api/test_models.py
Normal file
67
docs/examples/website-to-api/test_models.py
Normal file
@@ -0,0 +1,67 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test script for the new model management functionality.
|
||||
This script demonstrates how to save and use custom model configurations.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import requests
|
||||
import json
|
||||
|
||||
# API base URL
|
||||
BASE_URL = "http://localhost:8000"
|
||||
|
||||
def test_model_management():
|
||||
"""Test the model management endpoints."""
|
||||
|
||||
print("=== Testing Model Management ===")
|
||||
|
||||
# 1. List current models
|
||||
print("\n1. Listing current models:")
|
||||
response = requests.get(f"{BASE_URL}/models")
|
||||
print(f"Status: {response.status_code}")
|
||||
print(f"Response: {json.dumps(response.json(), indent=2)}")
|
||||
|
||||
|
||||
# 2. Save another model configuration (OpenAI example)
|
||||
print("\n2. Saving OpenAI model configuration:")
|
||||
openai_config = {
|
||||
"model_name": "my-openai",
|
||||
"provider": "openai",
|
||||
"api_token": "your-openai-api-key-here"
|
||||
}
|
||||
|
||||
response = requests.post(f"{BASE_URL}/models", json=openai_config)
|
||||
print(f"Status: {response.status_code}")
|
||||
print(f"Response: {json.dumps(response.json(), indent=2)}")
|
||||
|
||||
# 3. List models again to see the new ones
|
||||
print("\n3. Listing models after adding new ones:")
|
||||
response = requests.get(f"{BASE_URL}/models")
|
||||
print(f"Status: {response.status_code}")
|
||||
print(f"Response: {json.dumps(response.json(), indent=2)}")
|
||||
|
||||
# 4. Delete a model configuration
|
||||
print("\n4. Deleting a model configuration:")
|
||||
response = requests.delete(f"{BASE_URL}/models/my-openai")
|
||||
print(f"Status: {response.status_code}")
|
||||
print(f"Response: {json.dumps(response.json(), indent=2)}")
|
||||
|
||||
# 5. Final list of models
|
||||
print("\n5. Final list of models:")
|
||||
response = requests.get(f"{BASE_URL}/models")
|
||||
print(f"Status: {response.status_code}")
|
||||
print(f"Response: {json.dumps(response.json(), indent=2)}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("Model Management Test Script")
|
||||
print("Make sure the API server is running on http://localhost:8000")
|
||||
print("=" * 50)
|
||||
|
||||
try:
|
||||
test_model_management()
|
||||
except requests.exceptions.ConnectionError:
|
||||
print("Error: Could not connect to the API server.")
|
||||
print("Make sure the server is running with: python api_server.py")
|
||||
except Exception as e:
|
||||
print(f"Error: {e}")
|
||||
397
docs/examples/website-to-api/web_scraper_lib.py
Normal file
397
docs/examples/website-to-api/web_scraper_lib.py
Normal file
@@ -0,0 +1,397 @@
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CacheMode,
|
||||
CrawlerRunConfig,
|
||||
LLMConfig,
|
||||
JsonCssExtractionStrategy,
|
||||
LLMExtractionStrategy
|
||||
)
|
||||
import os
|
||||
import json
|
||||
import hashlib
|
||||
from typing import Dict, Any, Optional, List
|
||||
from litellm import completion
|
||||
|
||||
class ModelConfig:
|
||||
"""Configuration for LLM models."""
|
||||
|
||||
def __init__(self, provider: str, api_token: str):
|
||||
self.provider = provider
|
||||
self.api_token = api_token
|
||||
|
||||
def to_dict(self) -> Dict[str, Any]:
|
||||
return {
|
||||
"provider": self.provider,
|
||||
"api_token": self.api_token
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: Dict[str, Any]) -> 'ModelConfig':
|
||||
return cls(
|
||||
provider=data["provider"],
|
||||
api_token=data["api_token"]
|
||||
)
|
||||
|
||||
class WebScraperAgent:
|
||||
"""
|
||||
A mini library that converts any website into a structured data API.
|
||||
|
||||
Features:
|
||||
1. Provide a URL and tell AI what data you need in plain English
|
||||
2. Generate: Agent reverse-engineers the site and deploys custom scraper
|
||||
3. Integrate: Use private API endpoint to get structured data
|
||||
4. Support for custom LLM models and API keys
|
||||
"""
|
||||
|
||||
def __init__(self, schemas_dir: str = "schemas", models_dir: str = "models"):
|
||||
self.schemas_dir = schemas_dir
|
||||
self.models_dir = models_dir
|
||||
os.makedirs(self.schemas_dir, exist_ok=True)
|
||||
os.makedirs(self.models_dir, exist_ok=True)
|
||||
|
||||
def _generate_schema_key(self, url: str, query: str) -> str:
|
||||
"""Generate a unique key for schema caching based on URL and query."""
|
||||
content = f"{url}:{query}"
|
||||
return hashlib.md5(content.encode()).hexdigest()
|
||||
|
||||
def save_model_config(self, model_name: str, provider: str, api_token: str) -> bool:
|
||||
"""
|
||||
Save a model configuration for later use.
|
||||
|
||||
Args:
|
||||
model_name: User-friendly name for the model
|
||||
provider: LLM provider (e.g., 'gemini', 'openai', 'anthropic')
|
||||
api_token: API token for the provider
|
||||
|
||||
Returns:
|
||||
True if saved successfully
|
||||
"""
|
||||
try:
|
||||
model_config = ModelConfig(provider, api_token)
|
||||
config_path = os.path.join(self.models_dir, f"{model_name}.json")
|
||||
|
||||
with open(config_path, "w") as f:
|
||||
json.dump(model_config.to_dict(), f, indent=2)
|
||||
|
||||
print(f"Model configuration saved: {model_name}")
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"Failed to save model configuration: {e}")
|
||||
return False
|
||||
|
||||
def load_model_config(self, model_name: str) -> Optional[ModelConfig]:
|
||||
"""
|
||||
Load a saved model configuration.
|
||||
|
||||
Args:
|
||||
model_name: Name of the saved model configuration
|
||||
|
||||
Returns:
|
||||
ModelConfig object or None if not found
|
||||
"""
|
||||
try:
|
||||
config_path = os.path.join(self.models_dir, f"{model_name}.json")
|
||||
if not os.path.exists(config_path):
|
||||
return None
|
||||
|
||||
with open(config_path, "r") as f:
|
||||
data = json.load(f)
|
||||
|
||||
return ModelConfig.from_dict(data)
|
||||
except Exception as e:
|
||||
print(f"Failed to load model configuration: {e}")
|
||||
return None
|
||||
|
||||
def list_saved_models(self) -> List[str]:
|
||||
"""List all saved model configurations."""
|
||||
models = []
|
||||
for filename in os.listdir(self.models_dir):
|
||||
if filename.endswith('.json'):
|
||||
models.append(filename[:-5]) # Remove .json extension
|
||||
return models
|
||||
|
||||
def delete_model_config(self, model_name: str) -> bool:
|
||||
"""
|
||||
Delete a saved model configuration.
|
||||
|
||||
Args:
|
||||
model_name: Name of the model configuration to delete
|
||||
|
||||
Returns:
|
||||
True if deleted successfully
|
||||
"""
|
||||
try:
|
||||
config_path = os.path.join(self.models_dir, f"{model_name}.json")
|
||||
if os.path.exists(config_path):
|
||||
os.remove(config_path)
|
||||
print(f"Model configuration deleted: {model_name}")
|
||||
return True
|
||||
return False
|
||||
except Exception as e:
|
||||
print(f"Failed to delete model configuration: {e}")
|
||||
return False
|
||||
|
||||
async def _load_or_generate_schema(self, url: str, query: str, session_id: str = "schema_generator", model_name: Optional[str] = None) -> Dict[str, Any]:
|
||||
"""
|
||||
Loads schema from cache if exists, otherwise generates using AI.
|
||||
This is the "Generate" step - our agent reverse-engineers the site.
|
||||
|
||||
Args:
|
||||
url: URL to scrape
|
||||
query: Query for data extraction
|
||||
session_id: Session identifier
|
||||
model_name: Name of saved model configuration to use
|
||||
"""
|
||||
schema_key = self._generate_schema_key(url, query)
|
||||
schema_path = os.path.join(self.schemas_dir, f"{schema_key}.json")
|
||||
|
||||
if os.path.exists(schema_path):
|
||||
print(f"Schema found in cache for {url}")
|
||||
with open(schema_path, "r") as f:
|
||||
return json.load(f)
|
||||
|
||||
print(f"Generating new schema for {url}")
|
||||
print(f"Query: {query}")
|
||||
query += """
|
||||
IMPORTANT:
|
||||
GENERATE THE SCHEMA WITH ONLY THE FIELDS MENTIONED IN THE QUERY. MAKE SURE THE NUMBER OF FIELDS IN THE SCHEME MATCH THE NUMBER OF FIELDS IN THE QUERY.
|
||||
"""
|
||||
|
||||
# Step 1: Fetch the page HTML
|
||||
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
config=CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
session_id=session_id,
|
||||
simulate_user=True,
|
||||
remove_overlay_elements=True,
|
||||
delay_before_return_html=5,
|
||||
)
|
||||
)
|
||||
html = result.fit_html
|
||||
|
||||
# Step 2: Generate schema using AI with custom model if specified
|
||||
print("AI is analyzing the page structure...")
|
||||
|
||||
# Use custom model configuration if provided
|
||||
if model_name:
|
||||
model_config = self.load_model_config(model_name)
|
||||
if model_config:
|
||||
llm_config = LLMConfig(
|
||||
provider=model_config.provider,
|
||||
api_token=model_config.api_token
|
||||
)
|
||||
print(f"Using custom model: {model_name}")
|
||||
else:
|
||||
raise ValueError(f"Model configuration '{model_name}' not found. Please add it from the Models page.")
|
||||
else:
|
||||
# Require a model to be specified
|
||||
raise ValueError("No model specified. Please select a model from the dropdown or add one from the Models page.")
|
||||
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=html,
|
||||
llm_config=llm_config,
|
||||
query=query
|
||||
)
|
||||
|
||||
# Step 3: Cache the generated schema
|
||||
print(f"Schema generated and cached: {json.dumps(schema, indent=2)}")
|
||||
with open(schema_path, "w") as f:
|
||||
json.dump(schema, f, indent=2)
|
||||
|
||||
return schema
|
||||
|
||||
def _generate_llm_schema(self, query: str, llm_config: LLMConfig) -> Dict[str, Any]:
|
||||
"""
|
||||
Generate a schema for a given query using a custom LLM model.
|
||||
|
||||
Args:
|
||||
query: Plain English description of what data to extract
|
||||
model_config: Model configuration to use
|
||||
"""
|
||||
# ask the model to generate a schema for the given query in the form of a json.
|
||||
prompt = f"""
|
||||
IDENTIFY THE FIELDS FOR EXTRACTION MENTIONED IN THE QUERY and GENERATE A JSON SCHEMA FOR THE FIELDS.
|
||||
eg.
|
||||
{{
|
||||
"name": "str",
|
||||
"age": "str",
|
||||
"email": "str",
|
||||
"product_name": "str",
|
||||
"product_price": "str",
|
||||
"product_description": "str",
|
||||
"product_image": "str",
|
||||
"product_url": "str",
|
||||
"product_rating": "str",
|
||||
"product_reviews": "str",
|
||||
}}
|
||||
Here is the query:
|
||||
{query}
|
||||
IMPORTANT:
|
||||
THE RESULT SHOULD BE A JSON OBJECT.
|
||||
MAKE SURE THE NUMBER OF FIELDS IN THE RESULT MATCH THE NUMBER OF FIELDS IN THE QUERY.
|
||||
THE RESULT SHOULD BE A JSON OBJECT.
|
||||
"""
|
||||
response = completion(
|
||||
model=llm_config.provider,
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
api_key=llm_config.api_token,
|
||||
result_type="json"
|
||||
)
|
||||
|
||||
return response.json()["choices"][0]["message"]["content"]
|
||||
async def scrape_data_with_llm(self, url: str, query: str, model_name: Optional[str] = None) -> Dict[str, Any]:
|
||||
"""
|
||||
Scrape structured data from any website using a custom LLM model.
|
||||
|
||||
Args:
|
||||
url: The website URL to scrape
|
||||
query: Plain English description of what data to extract
|
||||
model_name: Name of saved model configuration to use
|
||||
"""
|
||||
|
||||
if model_name:
|
||||
model_config = self.load_model_config(model_name)
|
||||
if model_config:
|
||||
llm_config = LLMConfig(
|
||||
provider=model_config.provider,
|
||||
api_token=model_config.api_token
|
||||
)
|
||||
print(f"Using custom model: {model_name}")
|
||||
else:
|
||||
raise ValueError(f"Model configuration '{model_name}' not found. Please add it from the Models page.")
|
||||
else:
|
||||
# Require a model to be specified
|
||||
raise ValueError("No model specified. Please select a model from the dropdown or add one from the Models page.")
|
||||
|
||||
query += """\n
|
||||
IMPORTANT:
|
||||
THE RESULT SHOULD BE A JSON OBJECT WITH THE ONLY THE FIELDS MENTIONED IN THE QUERY.
|
||||
MAKE SURE THE NUMBER OF FIELDS IN THE RESULT MATCH THE NUMBER OF FIELDS IN THE QUERY.
|
||||
THE RESULT SHOULD BE A JSON OBJECT.
|
||||
"""
|
||||
|
||||
schema = self._generate_llm_schema(query, llm_config)
|
||||
|
||||
print(f"Schema: {schema}")
|
||||
|
||||
llm_extraction_strategy = LLMExtractionStrategy(
|
||||
llm_config=llm_config,
|
||||
instruction=query,
|
||||
result_type="json",
|
||||
schema=schema
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
config=CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
simulate_user=True,
|
||||
extraction_strategy=llm_extraction_strategy,
|
||||
)
|
||||
)
|
||||
extracted_data = result.extracted_content
|
||||
if isinstance(extracted_data, str):
|
||||
try:
|
||||
extracted_data = json.loads(extracted_data)
|
||||
except json.JSONDecodeError:
|
||||
# If it's not valid JSON, keep it as string
|
||||
pass
|
||||
|
||||
return {
|
||||
"url": url,
|
||||
"query": query,
|
||||
"extracted_data": extracted_data,
|
||||
"timestamp": result.timestamp if hasattr(result, 'timestamp') else None
|
||||
}
|
||||
|
||||
async def scrape_data(self, url: str, query: str, model_name: Optional[str] = None) -> Dict[str, Any]:
|
||||
"""
|
||||
Main method to scrape structured data from any website.
|
||||
|
||||
Args:
|
||||
url: The website URL to scrape
|
||||
query: Plain English description of what data to extract
|
||||
model_name: Name of saved model configuration to use
|
||||
|
||||
Returns:
|
||||
Structured data extracted from the website
|
||||
"""
|
||||
# Step 1: Generate or load schema (reverse-engineer the site)
|
||||
schema = await self._load_or_generate_schema(url=url, query=query, model_name=model_name)
|
||||
|
||||
# Step 2: Deploy custom high-speed scraper
|
||||
print(f"Deploying custom scraper for {url}")
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
run_config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema=schema),
|
||||
)
|
||||
result = await crawler.arun(url=url, config=run_config)
|
||||
|
||||
# Step 3: Return structured data
|
||||
# Parse extracted_content if it's a JSON string
|
||||
extracted_data = result.extracted_content
|
||||
if isinstance(extracted_data, str):
|
||||
try:
|
||||
extracted_data = json.loads(extracted_data)
|
||||
except json.JSONDecodeError:
|
||||
# If it's not valid JSON, keep it as string
|
||||
pass
|
||||
|
||||
return {
|
||||
"url": url,
|
||||
"query": query,
|
||||
"extracted_data": extracted_data,
|
||||
"schema_used": schema,
|
||||
"timestamp": result.timestamp if hasattr(result, 'timestamp') else None
|
||||
}
|
||||
|
||||
async def get_cached_schemas(self) -> Dict[str, str]:
|
||||
"""Get list of cached schemas."""
|
||||
schemas = {}
|
||||
for filename in os.listdir(self.schemas_dir):
|
||||
if filename.endswith('.json'):
|
||||
schema_key = filename[:-5] # Remove .json extension
|
||||
schemas[schema_key] = filename
|
||||
return schemas
|
||||
|
||||
def clear_cache(self):
|
||||
"""Clear all cached schemas."""
|
||||
import shutil
|
||||
if os.path.exists(self.schemas_dir):
|
||||
shutil.rmtree(self.schemas_dir)
|
||||
os.makedirs(self.schemas_dir, exist_ok=True)
|
||||
print("Schema cache cleared")
|
||||
|
||||
# Convenience function for simple usage
|
||||
async def scrape_website(url: str, query: str, model_name: Optional[str] = None) -> Dict[str, Any]:
|
||||
"""
|
||||
Simple function to scrape any website with plain English instructions.
|
||||
|
||||
Args:
|
||||
url: Website URL
|
||||
query: Plain English description of what data to extract
|
||||
model_name: Name of saved model configuration to use
|
||||
|
||||
Returns:
|
||||
Extracted structured data
|
||||
"""
|
||||
agent = WebScraperAgent()
|
||||
return await agent.scrape_data(url, query, model_name)
|
||||
|
||||
async def scrape_website_with_llm(url: str, query: str, model_name: Optional[str] = None):
|
||||
"""
|
||||
Scrape structured data from any website using a custom LLM model.
|
||||
|
||||
Args:
|
||||
url: The website URL to scrape
|
||||
query: Plain English description of what data to extract
|
||||
model_name: Name of saved model configuration to use
|
||||
"""
|
||||
agent = WebScraperAgent()
|
||||
return await agent.scrape_data_with_llm(url, query, model_name)
|
||||
@@ -126,30 +126,6 @@ Factors:
|
||||
- URL depth (fewer slashes = higher authority)
|
||||
- Clean URL structure
|
||||
|
||||
### Custom Link Scoring
|
||||
|
||||
```python
|
||||
class CustomLinkScorer:
|
||||
def score(self, link: Link, query: str, state: CrawlState) -> float:
|
||||
# Prioritize specific URL patterns
|
||||
if "/api/reference/" in link.href:
|
||||
return 2.0 # Double the score
|
||||
|
||||
# Deprioritize certain sections
|
||||
if "/archive/" in link.href:
|
||||
return 0.1 # Reduce score by 90%
|
||||
|
||||
# Default scoring
|
||||
return 1.0
|
||||
|
||||
# Use with adaptive crawler
|
||||
adaptive = AdaptiveCrawler(
|
||||
crawler,
|
||||
config=config,
|
||||
link_scorer=CustomLinkScorer()
|
||||
)
|
||||
```
|
||||
|
||||
## Domain-Specific Configurations
|
||||
|
||||
### Technical Documentation
|
||||
@@ -230,8 +206,12 @@ config = AdaptiveConfig(
|
||||
|
||||
# Periodically clean state
|
||||
if len(state.knowledge_base) > 1000:
|
||||
# Keep only most relevant
|
||||
state.knowledge_base = get_top_relevant(state.knowledge_base, 500)
|
||||
# Keep only the top 500 most relevant docs
|
||||
top_content = adaptive.get_relevant_content(top_k=500)
|
||||
keep_indices = {d["index"] for d in top_content}
|
||||
state.knowledge_base = [
|
||||
doc for i, doc in enumerate(state.knowledge_base) if i in keep_indices
|
||||
]
|
||||
```
|
||||
|
||||
### Parallel Processing
|
||||
@@ -252,18 +232,6 @@ tasks = [
|
||||
results = await asyncio.gather(*tasks)
|
||||
```
|
||||
|
||||
### Caching Strategy
|
||||
|
||||
```python
|
||||
# Enable caching for repeated crawls
|
||||
async with AsyncWebCrawler(
|
||||
config=BrowserConfig(
|
||||
cache_mode=CacheMode.ENABLED
|
||||
)
|
||||
) as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
```
|
||||
|
||||
## Debugging & Analysis
|
||||
|
||||
### Enable Verbose Logging
|
||||
@@ -322,9 +290,9 @@ with open("crawl_analysis.json", "w") as f:
|
||||
### Implementing a Custom Strategy
|
||||
|
||||
```python
|
||||
from crawl4ai.adaptive_crawler import BaseStrategy
|
||||
from crawl4ai.adaptive_crawler import CrawlStrategy
|
||||
|
||||
class DomainSpecificStrategy(BaseStrategy):
|
||||
class DomainSpecificStrategy(CrawlStrategy):
|
||||
def calculate_coverage(self, state: CrawlState) -> float:
|
||||
# Custom coverage calculation
|
||||
# e.g., weight certain terms more heavily
|
||||
@@ -351,7 +319,7 @@ adaptive = AdaptiveCrawler(
|
||||
### Combining Strategies
|
||||
|
||||
```python
|
||||
class HybridStrategy(BaseStrategy):
|
||||
class HybridStrategy(CrawlStrategy):
|
||||
def __init__(self):
|
||||
self.strategies = [
|
||||
TechnicalDocStrategy(),
|
||||
|
||||
@@ -155,6 +155,7 @@ If your page is a single-page app with repeated JS updates, set `js_only=True` i
|
||||
| **`exclude_external_links`** | `bool` (False) | Removes all links pointing outside the current domain. |
|
||||
| **`exclude_social_media_links`** | `bool` (False) | Strips links specifically to social sites (like Facebook or Twitter). |
|
||||
| **`exclude_domains`** | `list` ([]) | Provide a custom list of domains to exclude (like `["ads.com", "trackers.io"]`). |
|
||||
| **`preserve_https_for_internal_links`** | `bool` (False) | If `True`, preserves HTTPS scheme for internal links even when the server redirects to HTTP. Useful for security-conscious crawling. |
|
||||
|
||||
Use these for link-level content filtering (often to keep crawls “internal” or to remove spammy domains).
|
||||
|
||||
|
||||
@@ -472,6 +472,17 @@ Note that for BestFirstCrawlingStrategy, score_threshold is not needed since pag
|
||||
|
||||
5.**Balance breadth vs. depth.** Choose your strategy wisely - BFS for comprehensive coverage, DFS for deep exploration, BestFirst for focused relevance-based crawling.
|
||||
|
||||
6.**Preserve HTTPS for security.** If crawling HTTPS sites that redirect to HTTP, use `preserve_https_for_internal_links=True` to maintain secure connections:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=2),
|
||||
preserve_https_for_internal_links=True # Keep HTTPS even if server redirects to HTTP
|
||||
)
|
||||
```
|
||||
|
||||
This is especially useful for security-conscious crawling or when dealing with sites that support both protocols.
|
||||
|
||||
---
|
||||
|
||||
## 10. Summary & Next Steps
|
||||
|
||||
@@ -89,6 +89,16 @@ ANTHROPIC_API_KEY=your-anthropic-key
|
||||
# TOGETHER_API_KEY=your-together-key
|
||||
# MISTRAL_API_KEY=your-mistral-key
|
||||
# GEMINI_API_TOKEN=your-gemini-token
|
||||
|
||||
# Optional: Global LLM settings
|
||||
# LLM_PROVIDER=openai/gpt-4o-mini
|
||||
# LLM_TEMPERATURE=0.7
|
||||
# LLM_BASE_URL=https://api.custom.com/v1
|
||||
|
||||
# Optional: Provider-specific overrides
|
||||
# OPENAI_TEMPERATURE=0.5
|
||||
# OPENAI_BASE_URL=https://custom-openai.com/v1
|
||||
# ANTHROPIC_TEMPERATURE=0.3
|
||||
EOL
|
||||
```
|
||||
> 🔑 **Note**: Keep your API keys secure! Never commit `.llm.env` to version control.
|
||||
@@ -156,27 +166,43 @@ cp deploy/docker/.llm.env.example .llm.env
|
||||
|
||||
**Flexible LLM Provider Configuration:**
|
||||
|
||||
The Docker setup now supports flexible LLM provider configuration through three methods:
|
||||
The Docker setup now supports flexible LLM provider configuration through a hierarchical system:
|
||||
|
||||
1. **Environment Variable** (Highest Priority): Set `LLM_PROVIDER` to override the default
|
||||
```bash
|
||||
export LLM_PROVIDER="anthropic/claude-3-opus"
|
||||
# Or in your .llm.env file:
|
||||
# LLM_PROVIDER=anthropic/claude-3-opus
|
||||
```
|
||||
|
||||
2. **API Request Parameter**: Specify provider per request
|
||||
1. **API Request Parameters** (Highest Priority): Specify per request
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"f": "llm",
|
||||
"provider": "groq/mixtral-8x7b"
|
||||
"provider": "groq/mixtral-8x7b",
|
||||
"temperature": 0.7,
|
||||
"base_url": "https://api.custom.com/v1"
|
||||
}
|
||||
```
|
||||
|
||||
3. **Config File Default**: Falls back to `config.yml` (default: `openai/gpt-4o-mini`)
|
||||
2. **Provider-Specific Environment Variables**: Override for specific providers
|
||||
```bash
|
||||
# In your .llm.env file:
|
||||
OPENAI_TEMPERATURE=0.5
|
||||
OPENAI_BASE_URL=https://custom-openai.com/v1
|
||||
ANTHROPIC_TEMPERATURE=0.3
|
||||
```
|
||||
|
||||
The system automatically selects the appropriate API key based on the configured `api_key_env` in the config file.
|
||||
3. **Global Environment Variables**: Set defaults for all providers
|
||||
```bash
|
||||
# In your .llm.env file:
|
||||
LLM_PROVIDER=anthropic/claude-3-opus
|
||||
LLM_TEMPERATURE=0.7
|
||||
LLM_BASE_URL=https://api.proxy.com/v1
|
||||
```
|
||||
|
||||
4. **Config File Default**: Falls back to `config.yml` (default: `openai/gpt-4o-mini`)
|
||||
|
||||
The system automatically selects the appropriate API key based on the provider. LiteLLM handles finding the correct environment variable for each provider (e.g., OPENAI_API_KEY for OpenAI, GEMINI_API_TOKEN for Google Gemini, etc.).
|
||||
|
||||
**Supported LLM Parameters:**
|
||||
- `provider`: LLM provider and model (e.g., "openai/gpt-4", "anthropic/claude-3-opus")
|
||||
- `temperature`: Controls randomness (0.0-2.0, lower = more focused, higher = more creative)
|
||||
- `base_url`: Custom API endpoint for proxy servers or alternative endpoints
|
||||
|
||||
#### 3. Build and Run with Compose
|
||||
|
||||
@@ -555,6 +581,101 @@ Crucially, when sending configurations directly via JSON, they **must** follow t
|
||||
**LLM Extraction Strategy** *(Keep example, ensure schema uses type/value wrapper)*
|
||||
*(Keep Deep Crawler Example)*
|
||||
|
||||
### LLM Configuration Examples
|
||||
|
||||
The Docker API supports dynamic LLM configuration through multiple levels:
|
||||
|
||||
#### Temperature Control
|
||||
|
||||
Temperature affects the randomness of LLM responses (0.0 = deterministic, 2.0 = very creative):
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
# Low temperature for factual extraction
|
||||
response = requests.post(
|
||||
"http://localhost:11235/md",
|
||||
json={
|
||||
"url": "https://example.com",
|
||||
"f": "llm",
|
||||
"q": "Extract all dates and numbers from this page",
|
||||
"temperature": 0.2 # Very focused, deterministic
|
||||
}
|
||||
)
|
||||
|
||||
# High temperature for creative tasks
|
||||
response = requests.post(
|
||||
"http://localhost:11235/md",
|
||||
json={
|
||||
"url": "https://example.com",
|
||||
"f": "llm",
|
||||
"q": "Write a creative summary of this content",
|
||||
"temperature": 1.2 # More creative, varied responses
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
#### Custom API Endpoints
|
||||
|
||||
Use custom base URLs for proxy servers or alternative API endpoints:
|
||||
|
||||
```python
|
||||
|
||||
# Using a local LLM server
|
||||
response = requests.post(
|
||||
"http://localhost:11235/md",
|
||||
json={
|
||||
"url": "https://example.com",
|
||||
"f": "llm",
|
||||
"q": "Extract key information",
|
||||
"provider": "ollama/llama2",
|
||||
"base_url": "http://localhost:11434/v1"
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
#### Dynamic Provider Selection
|
||||
|
||||
Switch between providers based on task requirements:
|
||||
|
||||
```python
|
||||
async def smart_extraction(url: str, content_type: str):
|
||||
"""Select provider and temperature based on content type"""
|
||||
|
||||
configs = {
|
||||
"technical": {
|
||||
"provider": "openai/gpt-4",
|
||||
"temperature": 0.3,
|
||||
"query": "Extract technical specifications and code examples"
|
||||
},
|
||||
"creative": {
|
||||
"provider": "anthropic/claude-3-opus",
|
||||
"temperature": 0.9,
|
||||
"query": "Create an engaging narrative summary"
|
||||
},
|
||||
"quick": {
|
||||
"provider": "groq/mixtral-8x7b",
|
||||
"temperature": 0.5,
|
||||
"query": "Quick summary in bullet points"
|
||||
}
|
||||
}
|
||||
|
||||
config = configs.get(content_type, configs["quick"])
|
||||
|
||||
response = await httpx.post(
|
||||
"http://localhost:11235/md",
|
||||
json={
|
||||
"url": url,
|
||||
"f": "llm",
|
||||
"q": config["query"],
|
||||
"provider": config["provider"],
|
||||
"temperature": config["temperature"]
|
||||
}
|
||||
)
|
||||
|
||||
return response.json()
|
||||
```
|
||||
|
||||
### REST API Examples
|
||||
|
||||
Update URLs to use port `11235`.
|
||||
@@ -693,8 +814,8 @@ app:
|
||||
# Default LLM Configuration
|
||||
llm:
|
||||
provider: "openai/gpt-4o-mini" # Can be overridden by LLM_PROVIDER env var
|
||||
api_key_env: "OPENAI_API_KEY"
|
||||
# api_key: sk-... # If you pass the API key directly then api_key_env will be ignored
|
||||
# api_key: sk-... # If you pass the API key directly (not recommended)
|
||||
# temperature and base_url are controlled via environment variables or request parameters
|
||||
|
||||
# Redis Configuration (Used by internal Redis server managed by supervisord)
|
||||
redis:
|
||||
|
||||
@@ -79,7 +79,7 @@ if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
> IMPORTANT: By default cache mode is set to `CacheMode.ENABLED`. So to have fresh content, you need to set it to `CacheMode.BYPASS`
|
||||
> IMPORTANT: By default cache mode is set to `CacheMode.BYPASS` to have fresh content. Set `CacheMode.ENABLED` to enable caching.
|
||||
|
||||
We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.
|
||||
|
||||
|
||||
@@ -102,16 +102,16 @@ async def smart_blog_crawler():
|
||||
|
||||
# Step 2: Configure discovery - let's find all blog posts
|
||||
config = SeedingConfig(
|
||||
source="sitemap", # Use the website's sitemap
|
||||
pattern="*/blog/*.html", # Only blog posts
|
||||
source="sitemap+cc", # Use the website's sitemap+cc
|
||||
pattern="*/courses/*", # Only courses related posts
|
||||
extract_head=True, # Get page metadata
|
||||
max_urls=100 # Limit for this example
|
||||
)
|
||||
|
||||
# Step 3: Discover URLs from the Python blog
|
||||
print("🔍 Discovering blog posts...")
|
||||
print("🔍 Discovering course posts...")
|
||||
urls = await seeder.urls("realpython.com", config)
|
||||
print(f"✅ Found {len(urls)} blog posts")
|
||||
print(f"✅ Found {len(urls)} course posts")
|
||||
|
||||
# Step 4: Filter for Python tutorials (using metadata!)
|
||||
tutorials = [
|
||||
@@ -134,7 +134,8 @@ async def smart_blog_crawler():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
only_text=True,
|
||||
word_count_threshold=300 # Only substantial articles
|
||||
word_count_threshold=300, # Only substantial articles
|
||||
stream=True
|
||||
)
|
||||
|
||||
# Extract URLs and crawl them
|
||||
@@ -155,7 +156,7 @@ asyncio.run(smart_blog_crawler())
|
||||
|
||||
**What just happened?**
|
||||
|
||||
1. We discovered all blog URLs from the sitemap
|
||||
1. We discovered all blog URLs from the sitemap+cc
|
||||
2. We filtered using metadata (no crawling needed!)
|
||||
3. We crawled only the relevant tutorials
|
||||
4. We saved tons of time and bandwidth
|
||||
@@ -282,8 +283,8 @@ config = SeedingConfig(
|
||||
live_check=True, # Verify each URL is accessible
|
||||
concurrency=20 # Check 20 URLs in parallel
|
||||
)
|
||||
|
||||
urls = await seeder.urls("example.com", config)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
# Now you can filter by status
|
||||
live_urls = [u for u in urls if u["status"] == "valid"]
|
||||
@@ -311,8 +312,8 @@ This is where URL seeding gets really powerful. Instead of crawling entire pages
|
||||
config = SeedingConfig(
|
||||
extract_head=True # Extract metadata from <head> section
|
||||
)
|
||||
|
||||
urls = await seeder.urls("example.com", config)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
# Now each URL has rich metadata
|
||||
for url in urls[:3]:
|
||||
@@ -387,8 +388,8 @@ config = SeedingConfig(
|
||||
scoring_method="bm25",
|
||||
score_threshold=0.3
|
||||
)
|
||||
|
||||
urls = await seeder.urls("example.com", config)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
# URLs are scored based on:
|
||||
# 1. Domain parts matching (e.g., 'python' in python.example.com)
|
||||
@@ -429,8 +430,8 @@ config = SeedingConfig(
|
||||
extract_head=True,
|
||||
live_check=True
|
||||
)
|
||||
|
||||
urls = await seeder.urls("blog.example.com", config)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("blog.example.com", config)
|
||||
|
||||
# Analyze the results
|
||||
for url in urls[:5]:
|
||||
@@ -488,8 +489,8 @@ config = SeedingConfig(
|
||||
scoring_method="bm25", # Use BM25 algorithm
|
||||
score_threshold=0.3 # Minimum relevance score
|
||||
)
|
||||
|
||||
urls = await seeder.urls("realpython.com", config)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("realpython.com", config)
|
||||
|
||||
# Results are automatically sorted by relevance!
|
||||
for url in urls[:5]:
|
||||
@@ -511,8 +512,8 @@ config = SeedingConfig(
|
||||
score_threshold=0.5,
|
||||
max_urls=20
|
||||
)
|
||||
|
||||
urls = await seeder.urls("docs.example.com", config)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("docs.example.com", config)
|
||||
|
||||
# The highest scoring URLs will be API docs!
|
||||
```
|
||||
@@ -529,8 +530,8 @@ config = SeedingConfig(
|
||||
score_threshold=0.4,
|
||||
pattern="*/product/*" # Combine with pattern matching
|
||||
)
|
||||
|
||||
urls = await seeder.urls("shop.example.com", config)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("shop.example.com", config)
|
||||
|
||||
# Filter further by price (from metadata)
|
||||
affordable = [
|
||||
@@ -550,8 +551,8 @@ config = SeedingConfig(
|
||||
scoring_method="bm25",
|
||||
score_threshold=0.35
|
||||
)
|
||||
|
||||
urls = await seeder.urls("technews.com", config)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("technews.com", config)
|
||||
|
||||
# Filter by date
|
||||
from datetime import datetime, timedelta
|
||||
@@ -591,8 +592,8 @@ for query in queries:
|
||||
score_threshold=0.4,
|
||||
max_urls=10 # Top 10 per topic
|
||||
)
|
||||
|
||||
urls = await seeder.urls("learning-platform.com", config)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("learning-platform.com", config)
|
||||
all_tutorials.extend(urls)
|
||||
|
||||
# Remove duplicates while preserving order
|
||||
@@ -625,7 +626,8 @@ config = SeedingConfig(
|
||||
)
|
||||
|
||||
# Returns a dictionary: {domain: [urls]}
|
||||
results = await seeder.many_urls(domains, config)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
results = await seeder.many_urls(domains, config)
|
||||
|
||||
# Process results
|
||||
for domain, urls in results.items():
|
||||
@@ -654,8 +656,8 @@ config = SeedingConfig(
|
||||
pattern="*/blog/*",
|
||||
max_urls=100
|
||||
)
|
||||
|
||||
results = await seeder.many_urls(competitors, config)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
results = await seeder.many_urls(competitors, config)
|
||||
|
||||
# Analyze content types
|
||||
for domain, urls in results.items():
|
||||
@@ -690,8 +692,8 @@ config = SeedingConfig(
|
||||
score_threshold=0.3,
|
||||
max_urls=20 # Per site
|
||||
)
|
||||
|
||||
results = await seeder.many_urls(educational_sites, config)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
results = await seeder.many_urls(educational_sites, config)
|
||||
|
||||
# Find the best beginner tutorials
|
||||
all_tutorials = []
|
||||
@@ -731,8 +733,8 @@ config = SeedingConfig(
|
||||
score_threshold=0.5, # High threshold for relevance
|
||||
max_urls=10
|
||||
)
|
||||
|
||||
results = await seeder.many_urls(news_sites, config)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
results = await seeder.many_urls(news_sites, config)
|
||||
|
||||
# Collect all mentions
|
||||
mentions = []
|
||||
|
||||
@@ -7,7 +7,7 @@ name = "Crawl4AI"
|
||||
dynamic = ["version"]
|
||||
description = "🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & scraper"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.9"
|
||||
requires-python = ">=3.10"
|
||||
license = "Apache-2.0"
|
||||
authors = [
|
||||
{name = "Unclecode", email = "unclecode@kidocode.com"}
|
||||
@@ -36,6 +36,7 @@ dependencies = [
|
||||
"PyYAML>=6.0",
|
||||
"nltk>=3.9.1",
|
||||
"rich>=13.9.4",
|
||||
"cssselect>=1.2.0",
|
||||
"httpx>=0.27.2",
|
||||
"httpx[http2]>=0.27.2",
|
||||
"fake-useragent>=2.0.3",
|
||||
@@ -51,7 +52,6 @@ classifiers = [
|
||||
"Development Status :: 4 - Beta",
|
||||
"Intended Audience :: Developers",
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.9",
|
||||
"Programming Language :: Python :: 3.10",
|
||||
"Programming Language :: Python :: 3.11",
|
||||
"Programming Language :: Python :: 3.12",
|
||||
|
||||
@@ -24,6 +24,7 @@ psutil>=6.1.1
|
||||
PyYAML>=6.0
|
||||
nltk>=3.9.1
|
||||
rich>=13.9.4
|
||||
cssselect>=1.2.0
|
||||
chardet>=5.2.0
|
||||
brotli>=1.1.0
|
||||
httpx[http2]>=0.27.2
|
||||
|
||||
3
setup.py
3
setup.py
@@ -56,11 +56,10 @@ setup(
|
||||
"Development Status :: 3 - Alpha",
|
||||
"Intended Audience :: Developers",
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.9",
|
||||
"Programming Language :: Python :: 3.10",
|
||||
"Programming Language :: Python :: 3.11",
|
||||
"Programming Language :: Python :: 3.12",
|
||||
"Programming Language :: Python :: 3.13",
|
||||
],
|
||||
python_requires=">=3.9",
|
||||
python_requires=">=3.10",
|
||||
)
|
||||
|
||||
@@ -1,401 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test script to validate webhook implementation for /llm/job endpoint.
|
||||
|
||||
This tests that the /llm/job endpoint now supports webhooks
|
||||
following the same pattern as /crawl/job.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
# Add deploy/docker to path
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'deploy', 'docker'))
|
||||
|
||||
def test_llm_job_payload_model():
|
||||
"""Test that LlmJobPayload includes webhook_config field"""
|
||||
print("=" * 60)
|
||||
print("TEST 1: LlmJobPayload Model")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
from job import LlmJobPayload
|
||||
from schemas import WebhookConfig
|
||||
from pydantic import ValidationError
|
||||
|
||||
# Test with webhook_config
|
||||
payload_dict = {
|
||||
"url": "https://example.com",
|
||||
"q": "Extract main content",
|
||||
"schema": None,
|
||||
"cache": False,
|
||||
"provider": None,
|
||||
"webhook_config": {
|
||||
"webhook_url": "https://myapp.com/webhook",
|
||||
"webhook_data_in_payload": True,
|
||||
"webhook_headers": {"X-Secret": "token"}
|
||||
}
|
||||
}
|
||||
|
||||
payload = LlmJobPayload(**payload_dict)
|
||||
|
||||
print(f"✅ LlmJobPayload accepts webhook_config")
|
||||
print(f" - URL: {payload.url}")
|
||||
print(f" - Query: {payload.q}")
|
||||
print(f" - Webhook URL: {payload.webhook_config.webhook_url}")
|
||||
print(f" - Data in payload: {payload.webhook_config.webhook_data_in_payload}")
|
||||
|
||||
# Test without webhook_config (should be optional)
|
||||
minimal_payload = {
|
||||
"url": "https://example.com",
|
||||
"q": "Extract content"
|
||||
}
|
||||
|
||||
payload2 = LlmJobPayload(**minimal_payload)
|
||||
assert payload2.webhook_config is None, "webhook_config should be optional"
|
||||
print(f"✅ LlmJobPayload works without webhook_config (optional)")
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
def test_handle_llm_request_signature():
|
||||
"""Test that handle_llm_request accepts webhook_config parameter"""
|
||||
print("\n" + "=" * 60)
|
||||
print("TEST 2: handle_llm_request Function Signature")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
from api import handle_llm_request
|
||||
import inspect
|
||||
|
||||
sig = inspect.signature(handle_llm_request)
|
||||
params = list(sig.parameters.keys())
|
||||
|
||||
print(f"Function parameters: {params}")
|
||||
|
||||
if 'webhook_config' in params:
|
||||
print(f"✅ handle_llm_request has webhook_config parameter")
|
||||
|
||||
# Check that it's optional with default None
|
||||
webhook_param = sig.parameters['webhook_config']
|
||||
if webhook_param.default is None or webhook_param.default == inspect.Parameter.empty:
|
||||
print(f"✅ webhook_config is optional (default: {webhook_param.default})")
|
||||
else:
|
||||
print(f"⚠️ webhook_config default is: {webhook_param.default}")
|
||||
|
||||
return True
|
||||
else:
|
||||
print(f"❌ handle_llm_request missing webhook_config parameter")
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
def test_process_llm_extraction_signature():
|
||||
"""Test that process_llm_extraction accepts webhook_config parameter"""
|
||||
print("\n" + "=" * 60)
|
||||
print("TEST 3: process_llm_extraction Function Signature")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
from api import process_llm_extraction
|
||||
import inspect
|
||||
|
||||
sig = inspect.signature(process_llm_extraction)
|
||||
params = list(sig.parameters.keys())
|
||||
|
||||
print(f"Function parameters: {params}")
|
||||
|
||||
if 'webhook_config' in params:
|
||||
print(f"✅ process_llm_extraction has webhook_config parameter")
|
||||
|
||||
webhook_param = sig.parameters['webhook_config']
|
||||
if webhook_param.default is None or webhook_param.default == inspect.Parameter.empty:
|
||||
print(f"✅ webhook_config is optional (default: {webhook_param.default})")
|
||||
else:
|
||||
print(f"⚠️ webhook_config default is: {webhook_param.default}")
|
||||
|
||||
return True
|
||||
else:
|
||||
print(f"❌ process_llm_extraction missing webhook_config parameter")
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
def test_webhook_integration_in_api():
|
||||
"""Test that api.py properly integrates webhook notifications"""
|
||||
print("\n" + "=" * 60)
|
||||
print("TEST 4: Webhook Integration in process_llm_extraction")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
api_file = os.path.join(os.path.dirname(__file__), 'deploy', 'docker', 'api.py')
|
||||
|
||||
with open(api_file, 'r') as f:
|
||||
api_content = f.read()
|
||||
|
||||
# Check for WebhookDeliveryService initialization
|
||||
if 'webhook_service = WebhookDeliveryService(config)' in api_content:
|
||||
print("✅ process_llm_extraction initializes WebhookDeliveryService")
|
||||
else:
|
||||
print("❌ Missing WebhookDeliveryService initialization in process_llm_extraction")
|
||||
return False
|
||||
|
||||
# Check for notify_job_completion calls with llm_extraction
|
||||
if 'task_type="llm_extraction"' in api_content:
|
||||
print("✅ Uses correct task_type='llm_extraction' for notifications")
|
||||
else:
|
||||
print("❌ Missing task_type='llm_extraction' in webhook notifications")
|
||||
return False
|
||||
|
||||
# Count webhook notification calls (should have at least 3: success + 2 failure paths)
|
||||
notification_count = api_content.count('await webhook_service.notify_job_completion')
|
||||
# Find only in process_llm_extraction function
|
||||
llm_func_start = api_content.find('async def process_llm_extraction')
|
||||
llm_func_end = api_content.find('\nasync def ', llm_func_start + 1)
|
||||
if llm_func_end == -1:
|
||||
llm_func_end = len(api_content)
|
||||
|
||||
llm_func_content = api_content[llm_func_start:llm_func_end]
|
||||
llm_notification_count = llm_func_content.count('await webhook_service.notify_job_completion')
|
||||
|
||||
print(f"✅ Found {llm_notification_count} webhook notification calls in process_llm_extraction")
|
||||
|
||||
if llm_notification_count >= 3:
|
||||
print(f"✅ Sufficient notification points (success + failure paths)")
|
||||
else:
|
||||
print(f"⚠️ Expected at least 3 notification calls, found {llm_notification_count}")
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
def test_job_endpoint_integration():
|
||||
"""Test that /llm/job endpoint extracts and passes webhook_config"""
|
||||
print("\n" + "=" * 60)
|
||||
print("TEST 5: /llm/job Endpoint Integration")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
job_file = os.path.join(os.path.dirname(__file__), 'deploy', 'docker', 'job.py')
|
||||
|
||||
with open(job_file, 'r') as f:
|
||||
job_content = f.read()
|
||||
|
||||
# Find the llm_job_enqueue function
|
||||
llm_job_start = job_content.find('async def llm_job_enqueue')
|
||||
llm_job_end = job_content.find('\n\n@router', llm_job_start + 1)
|
||||
if llm_job_end == -1:
|
||||
llm_job_end = job_content.find('\n\nasync def', llm_job_start + 1)
|
||||
|
||||
llm_job_func = job_content[llm_job_start:llm_job_end]
|
||||
|
||||
# Check for webhook_config extraction
|
||||
if 'webhook_config = None' in llm_job_func:
|
||||
print("✅ llm_job_enqueue initializes webhook_config variable")
|
||||
else:
|
||||
print("❌ Missing webhook_config initialization")
|
||||
return False
|
||||
|
||||
if 'if payload.webhook_config:' in llm_job_func:
|
||||
print("✅ llm_job_enqueue checks for payload.webhook_config")
|
||||
else:
|
||||
print("❌ Missing webhook_config check")
|
||||
return False
|
||||
|
||||
if 'webhook_config = payload.webhook_config.model_dump(mode=\'json\')' in llm_job_func:
|
||||
print("✅ llm_job_enqueue converts webhook_config to dict")
|
||||
else:
|
||||
print("❌ Missing webhook_config.model_dump conversion")
|
||||
return False
|
||||
|
||||
if 'webhook_config=webhook_config' in llm_job_func:
|
||||
print("✅ llm_job_enqueue passes webhook_config to handle_llm_request")
|
||||
else:
|
||||
print("❌ Missing webhook_config parameter in handle_llm_request call")
|
||||
return False
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
def test_create_new_task_integration():
|
||||
"""Test that create_new_task stores webhook_config in Redis"""
|
||||
print("\n" + "=" * 60)
|
||||
print("TEST 6: create_new_task Webhook Storage")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
api_file = os.path.join(os.path.dirname(__file__), 'deploy', 'docker', 'api.py')
|
||||
|
||||
with open(api_file, 'r') as f:
|
||||
api_content = f.read()
|
||||
|
||||
# Find create_new_task function
|
||||
create_task_start = api_content.find('async def create_new_task')
|
||||
create_task_end = api_content.find('\nasync def ', create_task_start + 1)
|
||||
if create_task_end == -1:
|
||||
create_task_end = len(api_content)
|
||||
|
||||
create_task_func = api_content[create_task_start:create_task_end]
|
||||
|
||||
# Check for webhook_config storage
|
||||
if 'if webhook_config:' in create_task_func:
|
||||
print("✅ create_new_task checks for webhook_config")
|
||||
else:
|
||||
print("❌ Missing webhook_config check in create_new_task")
|
||||
return False
|
||||
|
||||
if 'task_data["webhook_config"] = json.dumps(webhook_config)' in create_task_func:
|
||||
print("✅ create_new_task stores webhook_config in Redis task data")
|
||||
else:
|
||||
print("❌ Missing webhook_config storage in task_data")
|
||||
return False
|
||||
|
||||
# Check that webhook_config is passed to process_llm_extraction
|
||||
if 'webhook_config' in create_task_func and 'background_tasks.add_task' in create_task_func:
|
||||
print("✅ create_new_task passes webhook_config to background task")
|
||||
else:
|
||||
print("⚠️ Could not verify webhook_config passed to background task")
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
def test_pattern_consistency():
|
||||
"""Test that /llm/job follows the same pattern as /crawl/job"""
|
||||
print("\n" + "=" * 60)
|
||||
print("TEST 7: Pattern Consistency with /crawl/job")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
api_file = os.path.join(os.path.dirname(__file__), 'deploy', 'docker', 'api.py')
|
||||
|
||||
with open(api_file, 'r') as f:
|
||||
api_content = f.read()
|
||||
|
||||
# Find handle_crawl_job to compare pattern
|
||||
crawl_job_start = api_content.find('async def handle_crawl_job')
|
||||
crawl_job_end = api_content.find('\nasync def ', crawl_job_start + 1)
|
||||
if crawl_job_end == -1:
|
||||
crawl_job_end = len(api_content)
|
||||
crawl_job_func = api_content[crawl_job_start:crawl_job_end]
|
||||
|
||||
# Find process_llm_extraction
|
||||
llm_extract_start = api_content.find('async def process_llm_extraction')
|
||||
llm_extract_end = api_content.find('\nasync def ', llm_extract_start + 1)
|
||||
if llm_extract_end == -1:
|
||||
llm_extract_end = len(api_content)
|
||||
llm_extract_func = api_content[llm_extract_start:llm_extract_end]
|
||||
|
||||
print("Checking pattern consistency...")
|
||||
|
||||
# Both should initialize WebhookDeliveryService
|
||||
crawl_has_service = 'webhook_service = WebhookDeliveryService(config)' in crawl_job_func
|
||||
llm_has_service = 'webhook_service = WebhookDeliveryService(config)' in llm_extract_func
|
||||
|
||||
if crawl_has_service and llm_has_service:
|
||||
print("✅ Both initialize WebhookDeliveryService")
|
||||
else:
|
||||
print(f"❌ Service initialization mismatch (crawl: {crawl_has_service}, llm: {llm_has_service})")
|
||||
return False
|
||||
|
||||
# Both should call notify_job_completion on success
|
||||
crawl_notifies_success = 'status="completed"' in crawl_job_func and 'notify_job_completion' in crawl_job_func
|
||||
llm_notifies_success = 'status="completed"' in llm_extract_func and 'notify_job_completion' in llm_extract_func
|
||||
|
||||
if crawl_notifies_success and llm_notifies_success:
|
||||
print("✅ Both notify on success")
|
||||
else:
|
||||
print(f"❌ Success notification mismatch (crawl: {crawl_notifies_success}, llm: {llm_notifies_success})")
|
||||
return False
|
||||
|
||||
# Both should call notify_job_completion on failure
|
||||
crawl_notifies_failure = 'status="failed"' in crawl_job_func and 'error=' in crawl_job_func
|
||||
llm_notifies_failure = 'status="failed"' in llm_extract_func and 'error=' in llm_extract_func
|
||||
|
||||
if crawl_notifies_failure and llm_notifies_failure:
|
||||
print("✅ Both notify on failure")
|
||||
else:
|
||||
print(f"❌ Failure notification mismatch (crawl: {crawl_notifies_failure}, llm: {llm_notifies_failure})")
|
||||
return False
|
||||
|
||||
print("✅ /llm/job follows the same pattern as /crawl/job")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
def main():
|
||||
"""Run all tests"""
|
||||
print("\n🧪 LLM Job Webhook Feature Validation")
|
||||
print("=" * 60)
|
||||
print("Testing that /llm/job now supports webhooks like /crawl/job")
|
||||
print("=" * 60 + "\n")
|
||||
|
||||
results = []
|
||||
|
||||
# Run all tests
|
||||
results.append(("LlmJobPayload Model", test_llm_job_payload_model()))
|
||||
results.append(("handle_llm_request Signature", test_handle_llm_request_signature()))
|
||||
results.append(("process_llm_extraction Signature", test_process_llm_extraction_signature()))
|
||||
results.append(("Webhook Integration", test_webhook_integration_in_api()))
|
||||
results.append(("/llm/job Endpoint", test_job_endpoint_integration()))
|
||||
results.append(("create_new_task Storage", test_create_new_task_integration()))
|
||||
results.append(("Pattern Consistency", test_pattern_consistency()))
|
||||
|
||||
# Print summary
|
||||
print("\n" + "=" * 60)
|
||||
print("TEST SUMMARY")
|
||||
print("=" * 60)
|
||||
|
||||
passed = sum(1 for _, result in results if result)
|
||||
total = len(results)
|
||||
|
||||
for test_name, result in results:
|
||||
status = "✅ PASS" if result else "❌ FAIL"
|
||||
print(f"{status} - {test_name}")
|
||||
|
||||
print(f"\n{'=' * 60}")
|
||||
print(f"Results: {passed}/{total} tests passed")
|
||||
print(f"{'=' * 60}")
|
||||
|
||||
if passed == total:
|
||||
print("\n🎉 All tests passed! /llm/job webhook feature is correctly implemented.")
|
||||
print("\n📝 Summary of changes:")
|
||||
print(" 1. LlmJobPayload model includes webhook_config field")
|
||||
print(" 2. /llm/job endpoint extracts and passes webhook_config")
|
||||
print(" 3. handle_llm_request accepts webhook_config parameter")
|
||||
print(" 4. create_new_task stores webhook_config in Redis")
|
||||
print(" 5. process_llm_extraction sends webhook notifications")
|
||||
print(" 6. Follows the same pattern as /crawl/job")
|
||||
return 0
|
||||
else:
|
||||
print(f"\n⚠️ {total - passed} test(s) failed. Please review the output above.")
|
||||
return 1
|
||||
|
||||
if __name__ == "__main__":
|
||||
exit(main())
|
||||
@@ -1,307 +0,0 @@
|
||||
"""
|
||||
Simple test script to validate webhook implementation without running full server.
|
||||
|
||||
This script tests:
|
||||
1. Webhook module imports and syntax
|
||||
2. WebhookDeliveryService initialization
|
||||
3. Payload construction logic
|
||||
4. Configuration parsing
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import json
|
||||
from datetime import datetime, timezone
|
||||
|
||||
# Add deploy/docker to path to import modules
|
||||
# sys.path.insert(0, '/home/user/crawl4ai/deploy/docker')
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'deploy', 'docker'))
|
||||
|
||||
def test_imports():
|
||||
"""Test that all webhook-related modules can be imported"""
|
||||
print("=" * 60)
|
||||
print("TEST 1: Module Imports")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
from webhook import WebhookDeliveryService
|
||||
print("✅ webhook.WebhookDeliveryService imported successfully")
|
||||
except Exception as e:
|
||||
print(f"❌ Failed to import webhook module: {e}")
|
||||
return False
|
||||
|
||||
try:
|
||||
from schemas import WebhookConfig, WebhookPayload
|
||||
print("✅ schemas.WebhookConfig imported successfully")
|
||||
print("✅ schemas.WebhookPayload imported successfully")
|
||||
except Exception as e:
|
||||
print(f"❌ Failed to import schemas: {e}")
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
def test_webhook_service_init():
|
||||
"""Test WebhookDeliveryService initialization"""
|
||||
print("\n" + "=" * 60)
|
||||
print("TEST 2: WebhookDeliveryService Initialization")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
from webhook import WebhookDeliveryService
|
||||
|
||||
# Test with default config
|
||||
config = {
|
||||
"webhooks": {
|
||||
"enabled": True,
|
||||
"default_url": None,
|
||||
"data_in_payload": False,
|
||||
"retry": {
|
||||
"max_attempts": 5,
|
||||
"initial_delay_ms": 1000,
|
||||
"max_delay_ms": 32000,
|
||||
"timeout_ms": 30000
|
||||
},
|
||||
"headers": {
|
||||
"User-Agent": "Crawl4AI-Webhook/1.0"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
service = WebhookDeliveryService(config)
|
||||
|
||||
print(f"✅ Service initialized successfully")
|
||||
print(f" - Max attempts: {service.max_attempts}")
|
||||
print(f" - Initial delay: {service.initial_delay}s")
|
||||
print(f" - Max delay: {service.max_delay}s")
|
||||
print(f" - Timeout: {service.timeout}s")
|
||||
|
||||
# Verify calculations
|
||||
assert service.max_attempts == 5, "Max attempts should be 5"
|
||||
assert service.initial_delay == 1.0, "Initial delay should be 1.0s"
|
||||
assert service.max_delay == 32.0, "Max delay should be 32.0s"
|
||||
assert service.timeout == 30.0, "Timeout should be 30.0s"
|
||||
|
||||
print("✅ All configuration values correct")
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Service initialization failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
def test_webhook_config_model():
|
||||
"""Test WebhookConfig Pydantic model"""
|
||||
print("\n" + "=" * 60)
|
||||
print("TEST 3: WebhookConfig Model Validation")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
from schemas import WebhookConfig
|
||||
from pydantic import ValidationError
|
||||
|
||||
# Test valid config
|
||||
valid_config = {
|
||||
"webhook_url": "https://example.com/webhook",
|
||||
"webhook_data_in_payload": True,
|
||||
"webhook_headers": {"X-Secret": "token123"}
|
||||
}
|
||||
|
||||
config = WebhookConfig(**valid_config)
|
||||
print(f"✅ Valid config accepted:")
|
||||
print(f" - URL: {config.webhook_url}")
|
||||
print(f" - Data in payload: {config.webhook_data_in_payload}")
|
||||
print(f" - Headers: {config.webhook_headers}")
|
||||
|
||||
# Test minimal config
|
||||
minimal_config = {
|
||||
"webhook_url": "https://example.com/webhook"
|
||||
}
|
||||
|
||||
config2 = WebhookConfig(**minimal_config)
|
||||
print(f"✅ Minimal config accepted (defaults applied):")
|
||||
print(f" - URL: {config2.webhook_url}")
|
||||
print(f" - Data in payload: {config2.webhook_data_in_payload}")
|
||||
print(f" - Headers: {config2.webhook_headers}")
|
||||
|
||||
# Test invalid URL
|
||||
try:
|
||||
invalid_config = {
|
||||
"webhook_url": "not-a-url"
|
||||
}
|
||||
config3 = WebhookConfig(**invalid_config)
|
||||
print(f"❌ Invalid URL should have been rejected")
|
||||
return False
|
||||
except ValidationError as e:
|
||||
print(f"✅ Invalid URL correctly rejected")
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Model validation test failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
def test_payload_construction():
|
||||
"""Test webhook payload construction logic"""
|
||||
print("\n" + "=" * 60)
|
||||
print("TEST 4: Payload Construction")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
# Simulate payload construction from notify_job_completion
|
||||
task_id = "crawl_abc123"
|
||||
task_type = "crawl"
|
||||
status = "completed"
|
||||
urls = ["https://example.com"]
|
||||
|
||||
payload = {
|
||||
"task_id": task_id,
|
||||
"task_type": task_type,
|
||||
"status": status,
|
||||
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||
"urls": urls
|
||||
}
|
||||
|
||||
print(f"✅ Basic payload constructed:")
|
||||
print(json.dumps(payload, indent=2))
|
||||
|
||||
# Test with error
|
||||
error_payload = {
|
||||
"task_id": "crawl_xyz789",
|
||||
"task_type": "crawl",
|
||||
"status": "failed",
|
||||
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||
"urls": ["https://example.com"],
|
||||
"error": "Connection timeout"
|
||||
}
|
||||
|
||||
print(f"\n✅ Error payload constructed:")
|
||||
print(json.dumps(error_payload, indent=2))
|
||||
|
||||
# Test with data
|
||||
data_payload = {
|
||||
"task_id": "crawl_def456",
|
||||
"task_type": "crawl",
|
||||
"status": "completed",
|
||||
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||
"urls": ["https://example.com"],
|
||||
"data": {
|
||||
"results": [
|
||||
{"url": "https://example.com", "markdown": "# Example"}
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
print(f"\n✅ Data payload constructed:")
|
||||
print(json.dumps(data_payload, indent=2))
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Payload construction failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
def test_exponential_backoff():
|
||||
"""Test exponential backoff calculation"""
|
||||
print("\n" + "=" * 60)
|
||||
print("TEST 5: Exponential Backoff Calculation")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
initial_delay = 1.0 # 1 second
|
||||
max_delay = 32.0 # 32 seconds
|
||||
|
||||
print("Backoff delays for 5 attempts:")
|
||||
for attempt in range(5):
|
||||
delay = min(initial_delay * (2 ** attempt), max_delay)
|
||||
print(f" Attempt {attempt + 1}: {delay}s")
|
||||
|
||||
# Verify the sequence: 1s, 2s, 4s, 8s, 16s
|
||||
expected = [1.0, 2.0, 4.0, 8.0, 16.0]
|
||||
actual = [min(initial_delay * (2 ** i), max_delay) for i in range(5)]
|
||||
|
||||
assert actual == expected, f"Expected {expected}, got {actual}"
|
||||
print("✅ Exponential backoff sequence correct")
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Backoff calculation failed: {e}")
|
||||
return False
|
||||
|
||||
def test_api_integration():
|
||||
"""Test that api.py imports webhook module correctly"""
|
||||
print("\n" + "=" * 60)
|
||||
print("TEST 6: API Integration")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
# Check if api.py can import webhook module
|
||||
api_path = os.path.join(os.path.dirname(__file__), 'deploy', 'docker', 'api.py')
|
||||
with open(api_path, 'r') as f:
|
||||
api_content = f.read()
|
||||
|
||||
if 'from webhook import WebhookDeliveryService' in api_content:
|
||||
print("✅ api.py imports WebhookDeliveryService")
|
||||
else:
|
||||
print("❌ api.py missing webhook import")
|
||||
return False
|
||||
|
||||
if 'WebhookDeliveryService(config)' in api_content:
|
||||
print("✅ api.py initializes WebhookDeliveryService")
|
||||
else:
|
||||
print("❌ api.py doesn't initialize WebhookDeliveryService")
|
||||
return False
|
||||
|
||||
if 'notify_job_completion' in api_content:
|
||||
print("✅ api.py calls notify_job_completion")
|
||||
else:
|
||||
print("❌ api.py doesn't call notify_job_completion")
|
||||
return False
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ API integration check failed: {e}")
|
||||
return False
|
||||
|
||||
def main():
|
||||
"""Run all tests"""
|
||||
print("\n🧪 Webhook Implementation Validation Tests")
|
||||
print("=" * 60)
|
||||
|
||||
results = []
|
||||
|
||||
# Run tests
|
||||
results.append(("Module Imports", test_imports()))
|
||||
results.append(("Service Initialization", test_webhook_service_init()))
|
||||
results.append(("Config Model", test_webhook_config_model()))
|
||||
results.append(("Payload Construction", test_payload_construction()))
|
||||
results.append(("Exponential Backoff", test_exponential_backoff()))
|
||||
results.append(("API Integration", test_api_integration()))
|
||||
|
||||
# Print summary
|
||||
print("\n" + "=" * 60)
|
||||
print("TEST SUMMARY")
|
||||
print("=" * 60)
|
||||
|
||||
passed = sum(1 for _, result in results if result)
|
||||
total = len(results)
|
||||
|
||||
for test_name, result in results:
|
||||
status = "✅ PASS" if result else "❌ FAIL"
|
||||
print(f"{status} - {test_name}")
|
||||
|
||||
print(f"\n{'=' * 60}")
|
||||
print(f"Results: {passed}/{total} tests passed")
|
||||
print(f"{'=' * 60}")
|
||||
|
||||
if passed == total:
|
||||
print("\n🎉 All tests passed! Webhook implementation is valid.")
|
||||
return 0
|
||||
else:
|
||||
print(f"\n⚠️ {total - passed} test(s) failed. Please review the output above.")
|
||||
return 1
|
||||
|
||||
if __name__ == "__main__":
|
||||
exit(main())
|
||||
@@ -1,251 +0,0 @@
|
||||
# Webhook Feature Test Script
|
||||
|
||||
This directory contains a comprehensive test script for the webhook feature implementation.
|
||||
|
||||
## Overview
|
||||
|
||||
The `test_webhook_feature.sh` script automates the entire process of testing the webhook feature:
|
||||
|
||||
1. ✅ Fetches and switches to the webhook feature branch
|
||||
2. ✅ Activates the virtual environment
|
||||
3. ✅ Installs all required dependencies
|
||||
4. ✅ Starts Redis server in background
|
||||
5. ✅ Starts Crawl4AI server in background
|
||||
6. ✅ Runs webhook integration test
|
||||
7. ✅ Verifies job completion via webhook
|
||||
8. ✅ Cleans up and returns to original branch
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.10+
|
||||
- Virtual environment already created (`venv/` in project root)
|
||||
- Git repository with the webhook feature branch
|
||||
- `redis-server` (script will attempt to install if missing)
|
||||
- `curl` and `lsof` commands available
|
||||
|
||||
## Usage
|
||||
|
||||
### Quick Start
|
||||
|
||||
From the project root:
|
||||
|
||||
```bash
|
||||
./tests/test_webhook_feature.sh
|
||||
```
|
||||
|
||||
Or from the tests directory:
|
||||
|
||||
```bash
|
||||
cd tests
|
||||
./test_webhook_feature.sh
|
||||
```
|
||||
|
||||
### What the Script Does
|
||||
|
||||
#### Step 1: Branch Management
|
||||
- Saves your current branch
|
||||
- Fetches the webhook feature branch from remote
|
||||
- Switches to the webhook feature branch
|
||||
|
||||
#### Step 2: Environment Setup
|
||||
- Activates your existing virtual environment
|
||||
- Installs dependencies from `deploy/docker/requirements.txt`
|
||||
- Installs Flask for the webhook receiver
|
||||
|
||||
#### Step 3: Service Startup
|
||||
- Starts Redis server on port 6379
|
||||
- Starts Crawl4AI server on port 11235
|
||||
- Waits for server health check to pass
|
||||
|
||||
#### Step 4: Webhook Test
|
||||
- Creates a webhook receiver on port 8080
|
||||
- Submits a crawl job for `https://example.com` with webhook config
|
||||
- Waits for webhook notification (60s timeout)
|
||||
- Verifies webhook payload contains expected data
|
||||
|
||||
#### Step 5: Cleanup
|
||||
- Stops webhook receiver
|
||||
- Stops Crawl4AI server
|
||||
- Stops Redis server
|
||||
- Returns to your original branch
|
||||
|
||||
## Expected Output
|
||||
|
||||
```
|
||||
[INFO] Starting webhook feature test script
|
||||
[INFO] Project root: /path/to/crawl4ai
|
||||
[INFO] Step 1: Fetching PR branch...
|
||||
[INFO] Current branch: develop
|
||||
[SUCCESS] Branch fetched
|
||||
[INFO] Step 2: Switching to branch: claude/implement-webhook-crawl-feature-011CULZY1Jy8N5MUkZqXkRVp
|
||||
[SUCCESS] Switched to webhook feature branch
|
||||
[INFO] Step 3: Activating virtual environment...
|
||||
[SUCCESS] Virtual environment activated
|
||||
[INFO] Step 4: Installing server dependencies...
|
||||
[SUCCESS] Dependencies installed
|
||||
[INFO] Step 5a: Starting Redis...
|
||||
[SUCCESS] Redis started (PID: 12345)
|
||||
[INFO] Step 5b: Starting server on port 11235...
|
||||
[INFO] Server started (PID: 12346)
|
||||
[INFO] Waiting for server to be ready...
|
||||
[SUCCESS] Server is ready!
|
||||
[INFO] Step 6: Creating webhook test script...
|
||||
[INFO] Running webhook test...
|
||||
|
||||
🚀 Submitting crawl job with webhook...
|
||||
✅ Job submitted successfully, task_id: crawl_abc123
|
||||
⏳ Waiting for webhook notification...
|
||||
|
||||
✅ Webhook received: {
|
||||
"task_id": "crawl_abc123",
|
||||
"task_type": "crawl",
|
||||
"status": "completed",
|
||||
"timestamp": "2025-10-22T00:00:00.000000+00:00",
|
||||
"urls": ["https://example.com"],
|
||||
"data": { ... }
|
||||
}
|
||||
|
||||
✅ Webhook received!
|
||||
Task ID: crawl_abc123
|
||||
Status: completed
|
||||
URLs: ['https://example.com']
|
||||
✅ Data included in webhook payload
|
||||
📄 Crawled 1 URL(s)
|
||||
- https://example.com: 1234 chars
|
||||
|
||||
🎉 Webhook test PASSED!
|
||||
|
||||
[INFO] Step 7: Verifying test results...
|
||||
[SUCCESS] ✅ Webhook test PASSED!
|
||||
[SUCCESS] All tests completed successfully! 🎉
|
||||
[INFO] Cleanup will happen automatically...
|
||||
[INFO] Starting cleanup...
|
||||
[INFO] Stopping webhook receiver...
|
||||
[INFO] Stopping server...
|
||||
[INFO] Stopping Redis...
|
||||
[INFO] Switching back to branch: develop
|
||||
[SUCCESS] Cleanup complete
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Server Failed to Start
|
||||
|
||||
If the server fails to start, check the logs:
|
||||
|
||||
```bash
|
||||
tail -100 /tmp/crawl4ai_server.log
|
||||
```
|
||||
|
||||
Common issues:
|
||||
- Port 11235 already in use: `lsof -ti:11235 | xargs kill -9`
|
||||
- Missing dependencies: Check that all packages are installed
|
||||
|
||||
### Redis Connection Failed
|
||||
|
||||
Check if Redis is running:
|
||||
|
||||
```bash
|
||||
redis-cli ping
|
||||
# Should return: PONG
|
||||
```
|
||||
|
||||
If not running:
|
||||
|
||||
```bash
|
||||
redis-server --port 6379 --daemonize yes
|
||||
```
|
||||
|
||||
### Webhook Not Received
|
||||
|
||||
The script has a 60-second timeout for webhook delivery. If the webhook isn't received:
|
||||
|
||||
1. Check server logs: `/tmp/crawl4ai_server.log`
|
||||
2. Verify webhook receiver is running on port 8080
|
||||
3. Check network connectivity between components
|
||||
|
||||
### Script Interruption
|
||||
|
||||
If the script is interrupted (Ctrl+C), cleanup happens automatically via trap. The script will:
|
||||
- Kill all background processes
|
||||
- Stop Redis
|
||||
- Return to your original branch
|
||||
|
||||
To manually cleanup if needed:
|
||||
|
||||
```bash
|
||||
# Kill processes by port
|
||||
lsof -ti:11235 | xargs kill -9 # Server
|
||||
lsof -ti:8080 | xargs kill -9 # Webhook receiver
|
||||
lsof -ti:6379 | xargs kill -9 # Redis
|
||||
|
||||
# Return to your branch
|
||||
git checkout develop # or your branch name
|
||||
```
|
||||
|
||||
## Testing Different URLs
|
||||
|
||||
To test with a different URL, modify the script or create a custom test:
|
||||
|
||||
```python
|
||||
payload = {
|
||||
"urls": ["https://your-url-here.com"],
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {"cache_mode": "bypass"},
|
||||
"webhook_config": {
|
||||
"webhook_url": "http://localhost:8080/webhook",
|
||||
"webhook_data_in_payload": True
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Files Generated
|
||||
|
||||
The script creates temporary files:
|
||||
|
||||
- `/tmp/crawl4ai_server.log` - Server output logs
|
||||
- `/tmp/test_webhook.py` - Webhook test Python script
|
||||
|
||||
These are not cleaned up automatically so you can review them after the test.
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - All tests passed successfully
|
||||
- `1` - Test failed (check output for details)
|
||||
|
||||
## Safety Features
|
||||
|
||||
- ✅ Automatic cleanup on exit, interrupt, or error
|
||||
- ✅ Returns to original branch on completion
|
||||
- ✅ Kills all background processes
|
||||
- ✅ Comprehensive error handling
|
||||
- ✅ Colored output for easy reading
|
||||
- ✅ Detailed logging at each step
|
||||
|
||||
## Notes
|
||||
|
||||
- The script uses `set -e` to exit on any command failure
|
||||
- All background processes are tracked and cleaned up
|
||||
- The virtual environment must exist before running
|
||||
- Redis must be available (installed or installable via apt-get/brew)
|
||||
|
||||
## Integration with CI/CD
|
||||
|
||||
This script can be integrated into CI/CD pipelines:
|
||||
|
||||
```yaml
|
||||
# Example GitHub Actions
|
||||
- name: Test Webhook Feature
|
||||
run: |
|
||||
chmod +x tests/test_webhook_feature.sh
|
||||
./tests/test_webhook_feature.sh
|
||||
```
|
||||
|
||||
## Support
|
||||
|
||||
If you encounter issues:
|
||||
|
||||
1. Check the troubleshooting section above
|
||||
2. Review server logs at `/tmp/crawl4ai_server.log`
|
||||
3. Ensure all prerequisites are met
|
||||
4. Open an issue with the full output of the script
|
||||
201
tests/docker/test_filter_deep_crawl.py
Normal file
201
tests/docker/test_filter_deep_crawl.py
Normal file
@@ -0,0 +1,201 @@
|
||||
"""
|
||||
Test the complete fix for both the filter serialization and JSON serialization issues.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import httpx
|
||||
|
||||
from crawl4ai import BrowserConfig, CacheMode, CrawlerRunConfig
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy, FilterChain, URLPatternFilter
|
||||
|
||||
BASE_URL = "http://localhost:11234/" # Adjust port as needed
|
||||
|
||||
async def test_with_docker_client():
|
||||
"""Test using the Docker client (same as 1419.py)."""
|
||||
from crawl4ai.docker_client import Crawl4aiDockerClient
|
||||
|
||||
print("=" * 60)
|
||||
print("Testing with Docker Client")
|
||||
print("=" * 60)
|
||||
|
||||
try:
|
||||
async with Crawl4aiDockerClient(
|
||||
base_url=BASE_URL,
|
||||
verbose=True,
|
||||
) as client:
|
||||
|
||||
# Create filter chain - testing the serialization fix
|
||||
filter_chain = [
|
||||
URLPatternFilter(
|
||||
# patterns=["*about*", "*privacy*", "*terms*"],
|
||||
patterns=["*advanced*"],
|
||||
reverse=True
|
||||
),
|
||||
]
|
||||
|
||||
crawler_config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2, # Keep it shallow for testing
|
||||
# max_pages=5, # Limit pages for testing
|
||||
filter_chain=FilterChain(filter_chain)
|
||||
),
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
)
|
||||
|
||||
print("\n1. Testing crawl with filters...")
|
||||
results = await client.crawl(
|
||||
["https://docs.crawl4ai.com"], # Simple test page
|
||||
browser_config=BrowserConfig(headless=True),
|
||||
crawler_config=crawler_config,
|
||||
)
|
||||
|
||||
if results:
|
||||
print(f"✅ Crawl succeeded! Type: {type(results)}")
|
||||
if hasattr(results, 'success'):
|
||||
print(f"✅ Results success: {results.success}")
|
||||
# Test that we can iterate results without JSON errors
|
||||
if hasattr(results, '__iter__'):
|
||||
for i, result in enumerate(results):
|
||||
if hasattr(result, 'url'):
|
||||
print(f" Result {i}: {result.url[:50]}...")
|
||||
else:
|
||||
print(f" Result {i}: {str(result)[:50]}...")
|
||||
else:
|
||||
# Handle list of results
|
||||
print(f"✅ Got {len(results)} results")
|
||||
for i, result in enumerate(results[:3]): # Show first 3
|
||||
print(f" Result {i}: {result.url[:50]}...")
|
||||
else:
|
||||
print("❌ Crawl failed - no results returned")
|
||||
return False
|
||||
|
||||
print("\n✅ Docker client test completed successfully!")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Docker client test failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
|
||||
async def test_with_rest_api():
|
||||
"""Test using REST API directly."""
|
||||
print("\n" + "=" * 60)
|
||||
print("Testing with REST API")
|
||||
print("=" * 60)
|
||||
|
||||
# Create filter configuration
|
||||
deep_crawl_strategy_payload = {
|
||||
"type": "BFSDeepCrawlStrategy",
|
||||
"params": {
|
||||
"max_depth": 2,
|
||||
# "max_pages": 5,
|
||||
"filter_chain": {
|
||||
"type": "FilterChain",
|
||||
"params": {
|
||||
"filters": [
|
||||
{
|
||||
"type": "URLPatternFilter",
|
||||
"params": {
|
||||
"patterns": ["*advanced*"],
|
||||
"reverse": True
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
crawl_payload = {
|
||||
"urls": ["https://docs.crawl4ai.com"],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"deep_crawl_strategy": deep_crawl_strategy_payload,
|
||||
"cache_mode": "bypass"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
async with httpx.AsyncClient() as client:
|
||||
print("\n1. Sending crawl request to REST API...")
|
||||
response = await client.post(
|
||||
f"{BASE_URL}crawl",
|
||||
json=crawl_payload,
|
||||
timeout=30
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
print(f"✅ REST API returned 200 OK")
|
||||
data = response.json()
|
||||
if data.get("success"):
|
||||
results = data.get("results", [])
|
||||
print(f"✅ Got {len(results)} results")
|
||||
for i, result in enumerate(results[:3]):
|
||||
print(f" Result {i}: {result.get('url', 'unknown')[:50]}...")
|
||||
else:
|
||||
print(f"❌ Crawl not successful: {data}")
|
||||
return False
|
||||
else:
|
||||
print(f"❌ REST API returned {response.status_code}")
|
||||
print(f" Response: {response.text[:500]}")
|
||||
return False
|
||||
|
||||
print("\n✅ REST API test completed successfully!")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ REST API test failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run all tests."""
|
||||
print("\n🧪 TESTING COMPLETE FIX FOR DOCKER FILTER AND JSON ISSUES")
|
||||
print("=" * 60)
|
||||
print("Make sure the server is running with the updated code!")
|
||||
print("=" * 60)
|
||||
|
||||
results = []
|
||||
|
||||
# Test 1: Docker client
|
||||
docker_passed = await test_with_docker_client()
|
||||
results.append(("Docker Client", docker_passed))
|
||||
|
||||
# Test 2: REST API
|
||||
rest_passed = await test_with_rest_api()
|
||||
results.append(("REST API", rest_passed))
|
||||
|
||||
# Summary
|
||||
print("\n" + "=" * 60)
|
||||
print("FINAL TEST SUMMARY")
|
||||
print("=" * 60)
|
||||
|
||||
all_passed = True
|
||||
for test_name, passed in results:
|
||||
status = "✅ PASSED" if passed else "❌ FAILED"
|
||||
print(f"{test_name:20} {status}")
|
||||
if not passed:
|
||||
all_passed = False
|
||||
|
||||
print("=" * 60)
|
||||
if all_passed:
|
||||
print("🎉 ALL TESTS PASSED! Both issues are fully resolved!")
|
||||
print("\nThe fixes:")
|
||||
print("1. Filter serialization: Fixed by not serializing private __slots__")
|
||||
print("2. JSON serialization: Fixed by removing property descriptors from model_dump()")
|
||||
else:
|
||||
print("⚠️ Some tests failed. Please check the server logs for details.")
|
||||
|
||||
return 0 if all_passed else 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
sys.exit(asyncio.run(main()))
|
||||
349
tests/docker/test_llm_params.py
Executable file
349
tests/docker/test_llm_params.py
Executable file
@@ -0,0 +1,349 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test script for LLM temperature and base_url parameters in Crawl4AI Docker API.
|
||||
This demonstrates the new hierarchical configuration system:
|
||||
1. Request-level parameters (highest priority)
|
||||
2. Provider-specific environment variables
|
||||
3. Global environment variables
|
||||
4. System defaults (lowest priority)
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import httpx
|
||||
import json
|
||||
import os
|
||||
from rich.console import Console
|
||||
from rich.panel import Panel
|
||||
from rich.syntax import Syntax
|
||||
from rich.table import Table
|
||||
|
||||
|
||||
console = Console()
|
||||
|
||||
# Configuration
|
||||
BASE_URL = "http://localhost:11235" # Docker API endpoint
|
||||
TEST_URL = "https://httpbin.org/html" # Simple test page
|
||||
|
||||
# --- Helper Functions ---
|
||||
|
||||
async def check_server_health(client: httpx.AsyncClient) -> bool:
|
||||
"""Check if the server is healthy."""
|
||||
console.print("[bold cyan]Checking server health...[/]", end="")
|
||||
try:
|
||||
response = await client.get("/health", timeout=10.0)
|
||||
response.raise_for_status()
|
||||
console.print(" [bold green]✓ Server is healthy![/]")
|
||||
return True
|
||||
except Exception as e:
|
||||
console.print(f"\n[bold red]✗ Server health check failed: {e}[/]")
|
||||
console.print(f"Is the server running at {BASE_URL}?")
|
||||
return False
|
||||
|
||||
def print_request(endpoint: str, payload: dict, title: str = "Request"):
|
||||
"""Pretty print the request."""
|
||||
syntax = Syntax(json.dumps(payload, indent=2), "json", theme="monokai")
|
||||
console.print(Panel.fit(
|
||||
f"[cyan]POST {endpoint}[/cyan]\n{syntax}",
|
||||
title=f"[bold blue]{title}[/]",
|
||||
border_style="blue"
|
||||
))
|
||||
|
||||
def print_response(response: dict, title: str = "Response"):
|
||||
"""Pretty print relevant parts of the response."""
|
||||
# Extract only the relevant parts
|
||||
relevant = {}
|
||||
if "markdown" in response:
|
||||
relevant["markdown"] = response["markdown"][:200] + "..." if len(response.get("markdown", "")) > 200 else response.get("markdown", "")
|
||||
if "success" in response:
|
||||
relevant["success"] = response["success"]
|
||||
if "url" in response:
|
||||
relevant["url"] = response["url"]
|
||||
if "filter" in response:
|
||||
relevant["filter"] = response["filter"]
|
||||
|
||||
console.print(Panel.fit(
|
||||
Syntax(json.dumps(relevant, indent=2), "json", theme="monokai"),
|
||||
title=f"[bold green]{title}[/]",
|
||||
border_style="green"
|
||||
))
|
||||
|
||||
# --- Test Functions ---
|
||||
|
||||
async def test_default_no_params(client: httpx.AsyncClient):
|
||||
"""Test 1: No temperature or base_url specified - uses defaults"""
|
||||
console.rule("[bold yellow]Test 1: Default Configuration (No Parameters)[/]")
|
||||
|
||||
payload = {
|
||||
"url": TEST_URL,
|
||||
"f": "llm",
|
||||
"q": "What is the main heading of this page? Answer in exactly 5 words."
|
||||
}
|
||||
|
||||
print_request("/md", payload, "Request without temperature/base_url")
|
||||
|
||||
try:
|
||||
response = await client.post("/md", json=payload, timeout=30.0)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
print_response(data, "Response (using system defaults)")
|
||||
console.print("[dim]→ This used system defaults or environment variables if set[/]")
|
||||
except Exception as e:
|
||||
console.print(f"[red]Error: {e}[/]")
|
||||
|
||||
async def test_request_temperature(client: httpx.AsyncClient):
|
||||
"""Test 2: Request-level temperature (highest priority)"""
|
||||
console.rule("[bold yellow]Test 2: Request-Level Temperature[/]")
|
||||
|
||||
# Test with low temperature (more focused)
|
||||
payload_low = {
|
||||
"url": TEST_URL,
|
||||
"f": "llm",
|
||||
"q": "What is the main heading? Be creative and poetic.",
|
||||
"temperature": 0.1 # Very low - should be less creative
|
||||
}
|
||||
|
||||
print_request("/md", payload_low, "Low Temperature (0.1)")
|
||||
|
||||
try:
|
||||
response = await client.post("/md", json=payload_low, timeout=30.0)
|
||||
response.raise_for_status()
|
||||
data_low = response.json()
|
||||
print_response(data_low, "Response with Low Temperature")
|
||||
console.print("[dim]→ Low temperature (0.1) should produce focused, less creative output[/]")
|
||||
except Exception as e:
|
||||
console.print(f"[red]Error: {e}[/]")
|
||||
|
||||
console.print()
|
||||
|
||||
# Test with high temperature (more creative)
|
||||
payload_high = {
|
||||
"url": TEST_URL,
|
||||
"f": "llm",
|
||||
"q": "What is the main heading? Be creative and poetic.",
|
||||
"temperature": 1.5 # High - should be more creative
|
||||
}
|
||||
|
||||
print_request("/md", payload_high, "High Temperature (1.5)")
|
||||
|
||||
try:
|
||||
response = await client.post("/md", json=payload_high, timeout=30.0)
|
||||
response.raise_for_status()
|
||||
data_high = response.json()
|
||||
print_response(data_high, "Response with High Temperature")
|
||||
console.print("[dim]→ High temperature (1.5) should produce more creative, varied output[/]")
|
||||
except Exception as e:
|
||||
console.print(f"[red]Error: {e}[/]")
|
||||
|
||||
async def test_provider_override(client: httpx.AsyncClient):
|
||||
"""Test 3: Provider override with temperature"""
|
||||
console.rule("[bold yellow]Test 3: Provider Override with Temperature[/]")
|
||||
|
||||
provider = "gemini/gemini-2.5-flash-lite"
|
||||
payload = {
|
||||
"url": TEST_URL,
|
||||
"f": "llm",
|
||||
"q": "Summarize this page in one sentence.",
|
||||
"provider": provider, # Explicitly set provider
|
||||
"temperature": 0.7
|
||||
}
|
||||
|
||||
print_request("/md", payload, "Provider + Temperature Override")
|
||||
|
||||
try:
|
||||
response = await client.post("/md", json=payload, timeout=30.0)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
print_response(data, "Response with Provider Override")
|
||||
console.print(f"[dim]→ This explicitly uses {provider} with temperature 0.7[/]")
|
||||
except Exception as e:
|
||||
console.print(f"[red]Error: {e}[/]")
|
||||
|
||||
async def test_base_url_custom(client: httpx.AsyncClient):
|
||||
"""Test 4: Custom base_url (will fail unless you have a custom endpoint)"""
|
||||
console.rule("[bold yellow]Test 4: Custom Base URL (Demo Only)[/]")
|
||||
|
||||
payload = {
|
||||
"url": TEST_URL,
|
||||
"f": "llm",
|
||||
"q": "What is this page about?",
|
||||
"base_url": "https://api.custom-endpoint.com/v1", # Custom endpoint
|
||||
"temperature": 0.5
|
||||
}
|
||||
|
||||
print_request("/md", payload, "Custom Base URL Request")
|
||||
console.print("[yellow]Note: This will fail unless you have a custom endpoint set up[/]")
|
||||
|
||||
try:
|
||||
response = await client.post("/md", json=payload, timeout=10.0)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
print_response(data, "Response from Custom Endpoint")
|
||||
except httpx.HTTPStatusError as e:
|
||||
console.print(f"[yellow]Expected failure (no custom endpoint): Status {e.response.status_code}[/]")
|
||||
except Exception as e:
|
||||
console.print(f"[yellow]Expected error: {e}[/]")
|
||||
|
||||
async def test_llm_job_endpoint(client: httpx.AsyncClient):
|
||||
"""Test 5: Test the /llm/job endpoint with temperature and base_url"""
|
||||
console.rule("[bold yellow]Test 5: LLM Job Endpoint with Parameters[/]")
|
||||
|
||||
payload = {
|
||||
"url": TEST_URL,
|
||||
"q": "Extract the main title and any key information",
|
||||
"temperature": 0.3,
|
||||
# "base_url": "https://api.openai.com/v1" # Optional
|
||||
}
|
||||
|
||||
print_request("/llm/job", payload, "LLM Job with Temperature")
|
||||
|
||||
try:
|
||||
# Submit the job
|
||||
response = await client.post("/llm/job", json=payload, timeout=30.0)
|
||||
response.raise_for_status()
|
||||
job_data = response.json()
|
||||
|
||||
if "task_id" in job_data:
|
||||
task_id = job_data["task_id"]
|
||||
console.print(f"[green]Job created with task_id: {task_id}[/]")
|
||||
|
||||
# Poll for result (simplified - in production use proper polling)
|
||||
await asyncio.sleep(3)
|
||||
|
||||
status_response = await client.get(f"/llm/job/{task_id}")
|
||||
status_data = status_response.json()
|
||||
|
||||
if status_data.get("status") == "completed":
|
||||
console.print("[green]Job completed successfully![/]")
|
||||
if "result" in status_data:
|
||||
console.print(Panel.fit(
|
||||
Syntax(json.dumps(status_data["result"], indent=2), "json", theme="monokai"),
|
||||
title="Extraction Result",
|
||||
border_style="green"
|
||||
))
|
||||
else:
|
||||
console.print(f"[yellow]Job status: {status_data.get('status', 'unknown')}[/]")
|
||||
else:
|
||||
console.print(f"[red]Unexpected response: {job_data}[/]")
|
||||
|
||||
except Exception as e:
|
||||
console.print(f"[red]Error: {e}[/]")
|
||||
|
||||
|
||||
async def test_llm_endpoint(client: httpx.AsyncClient):
|
||||
"""
|
||||
Quick QA round-trip with /llm.
|
||||
Asks a trivial question against SIMPLE_URL just to show wiring.
|
||||
"""
|
||||
import time
|
||||
import urllib.parse
|
||||
|
||||
page_url = "https://kidocode.com"
|
||||
question = "What is the title of this page?"
|
||||
|
||||
enc = urllib.parse.quote_plus(page_url, safe="")
|
||||
console.print(f"GET /llm/{enc}?q={question}")
|
||||
|
||||
try:
|
||||
t0 = time.time()
|
||||
resp = await client.get(f"/llm/{enc}", params={"q": question})
|
||||
dt = time.time() - t0
|
||||
console.print(
|
||||
f"Response Status: [bold {'green' if resp.is_success else 'red'}]{resp.status_code}[/] (took {dt:.2f}s)")
|
||||
resp.raise_for_status()
|
||||
answer = resp.json().get("answer", "")
|
||||
console.print(Panel(answer or "No answer returned",
|
||||
title="LLM answer", border_style="magenta", expand=False))
|
||||
except Exception as e:
|
||||
console.print(f"[bold red]Error hitting /llm:[/] {e}")
|
||||
|
||||
|
||||
async def show_environment_info():
|
||||
"""Display current environment configuration"""
|
||||
console.rule("[bold cyan]Current Environment Configuration[/]")
|
||||
|
||||
table = Table(title="LLM Environment Variables", show_header=True, header_style="bold magenta")
|
||||
table.add_column("Variable", style="cyan", width=30)
|
||||
table.add_column("Value", style="yellow")
|
||||
table.add_column("Description", style="dim")
|
||||
|
||||
env_vars = [
|
||||
("LLM_PROVIDER", "Global default provider"),
|
||||
("LLM_TEMPERATURE", "Global default temperature"),
|
||||
("LLM_BASE_URL", "Global custom API endpoint"),
|
||||
("OPENAI_API_KEY", "OpenAI API key"),
|
||||
("OPENAI_TEMPERATURE", "OpenAI-specific temperature"),
|
||||
("OPENAI_BASE_URL", "OpenAI-specific endpoint"),
|
||||
("ANTHROPIC_API_KEY", "Anthropic API key"),
|
||||
("ANTHROPIC_TEMPERATURE", "Anthropic-specific temperature"),
|
||||
("GROQ_API_KEY", "Groq API key"),
|
||||
("GROQ_TEMPERATURE", "Groq-specific temperature"),
|
||||
]
|
||||
|
||||
for var, desc in env_vars:
|
||||
value = os.environ.get(var, "[not set]")
|
||||
if "API_KEY" in var and value != "[not set]":
|
||||
# Mask API keys for security
|
||||
value = value[:10] + "..." if len(value) > 10 else "***"
|
||||
table.add_row(var, value, desc)
|
||||
|
||||
console.print(table)
|
||||
console.print()
|
||||
|
||||
# --- Main Test Runner ---
|
||||
|
||||
async def main():
|
||||
"""Run all tests"""
|
||||
console.print(Panel.fit(
|
||||
"[bold cyan]Crawl4AI LLM Parameters Test Suite[/]\n" +
|
||||
"Testing temperature and base_url configuration hierarchy",
|
||||
border_style="cyan"
|
||||
))
|
||||
|
||||
# Show current environment
|
||||
# await show_environment_info()
|
||||
|
||||
# Create HTTP client
|
||||
async with httpx.AsyncClient(base_url=BASE_URL, timeout=60.0) as client:
|
||||
# Check server health
|
||||
if not await check_server_health(client):
|
||||
console.print("[red]Server is not available. Please ensure the Docker container is running.[/]")
|
||||
return
|
||||
|
||||
# Run tests
|
||||
tests = [
|
||||
("Default Configuration", test_default_no_params),
|
||||
("Request Temperature", test_request_temperature),
|
||||
("Provider Override", test_provider_override),
|
||||
("Custom Base URL", test_base_url_custom),
|
||||
("LLM Job Endpoint", test_llm_job_endpoint),
|
||||
("LLM Endpoint", test_llm_endpoint),
|
||||
]
|
||||
|
||||
for i, (name, test_func) in enumerate(tests, 1):
|
||||
if i > 1:
|
||||
console.print() # Add spacing between tests
|
||||
|
||||
try:
|
||||
await test_func(client)
|
||||
except Exception as e:
|
||||
console.print(f"[red]Test '{name}' failed with error: {e}[/]")
|
||||
console.print_exception(show_locals=False)
|
||||
|
||||
console.rule("[bold green]All Tests Complete![/]", style="green")
|
||||
|
||||
# Summary
|
||||
console.print("\n[bold cyan]Configuration Hierarchy Summary:[/]")
|
||||
console.print("1. [yellow]Request parameters[/] - Highest priority (temperature, base_url in API call)")
|
||||
console.print("2. [yellow]Provider-specific env[/] - e.g., OPENAI_TEMPERATURE, GROQ_BASE_URL")
|
||||
console.print("3. [yellow]Global env variables[/] - LLM_TEMPERATURE, LLM_BASE_URL")
|
||||
console.print("4. [yellow]System defaults[/] - Lowest priority (provider/litellm defaults)")
|
||||
console.print()
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
asyncio.run(main())
|
||||
except KeyboardInterrupt:
|
||||
console.print("\n[yellow]Tests interrupted by user.[/]")
|
||||
except Exception as e:
|
||||
console.print(f"\n[bold red]An error occurred:[/]")
|
||||
console.print_exception(show_locals=False)
|
||||
@@ -635,7 +635,209 @@ class TestCrawlEndpoints:
|
||||
pytest.fail(f"LLM extracted content parsing or validation failed: {e}\nContent: {result['extracted_content']}")
|
||||
except Exception as e: # Catch any other unexpected error
|
||||
pytest.fail(f"An unexpected error occurred during LLM result processing: {e}\nContent: {result['extracted_content']}")
|
||||
|
||||
|
||||
|
||||
# 7. Error Handling Tests
|
||||
async def test_invalid_url_handling(self, async_client: httpx.AsyncClient):
|
||||
"""Test error handling for invalid URLs."""
|
||||
payload = {
|
||||
"urls": ["invalid-url", "https://nonexistent-domain-12345.com"],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": CacheMode.BYPASS.value}}
|
||||
}
|
||||
|
||||
response = await async_client.post("/crawl", json=payload)
|
||||
# Should return 200 with failed results, not 500
|
||||
print(f"Status code: {response.status_code}")
|
||||
print(f"Response: {response.text}")
|
||||
assert response.status_code == 500
|
||||
data = response.json()
|
||||
assert data["detail"].startswith("Crawl request failed:")
|
||||
|
||||
async def test_mixed_success_failure_urls(self, async_client: httpx.AsyncClient):
|
||||
"""Test handling of mixed success/failure URLs."""
|
||||
payload = {
|
||||
"urls": [
|
||||
SIMPLE_HTML_URL, # Should succeed
|
||||
"https://nonexistent-domain-12345.com", # Should fail
|
||||
"https://invalid-url-with-special-chars-!@#$%^&*()", # Should fail
|
||||
],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"cache_mode": CacheMode.BYPASS.value,
|
||||
"markdown_generator": {
|
||||
"type": "DefaultMarkdownGenerator",
|
||||
"params": {
|
||||
"content_filter": {
|
||||
"type": "PruningContentFilter",
|
||||
"params": {"threshold": 0.5}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
response = await async_client.post("/crawl", json=payload)
|
||||
assert response.status_code == 200
|
||||
data = response.json()
|
||||
assert data["success"] is True
|
||||
assert len(data["results"]) == 3
|
||||
|
||||
success_count = 0
|
||||
failure_count = 0
|
||||
|
||||
for result in data["results"]:
|
||||
if result["success"]:
|
||||
success_count += 1
|
||||
else:
|
||||
failure_count += 1
|
||||
assert "error_message" in result
|
||||
assert len(result["error_message"]) > 0
|
||||
|
||||
assert success_count >= 1 # At least one should succeed
|
||||
assert failure_count >= 1 # At least one should fail
|
||||
|
||||
async def test_streaming_mixed_urls(self, async_client: httpx.AsyncClient):
|
||||
"""Test streaming with mixed success/failure URLs."""
|
||||
payload = {
|
||||
"urls": [
|
||||
SIMPLE_HTML_URL, # Should succeed
|
||||
"https://nonexistent-domain-12345.com", # Should fail
|
||||
],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"stream": True,
|
||||
"cache_mode": CacheMode.BYPASS.value
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
async with async_client.stream("POST", "/crawl/stream", json=payload) as response:
|
||||
response.raise_for_status()
|
||||
results = await process_streaming_response(response)
|
||||
|
||||
assert len(results) == 2
|
||||
|
||||
success_count = 0
|
||||
failure_count = 0
|
||||
|
||||
for result in results:
|
||||
if result["success"]:
|
||||
success_count += 1
|
||||
assert result["url"] == SIMPLE_HTML_URL
|
||||
else:
|
||||
failure_count += 1
|
||||
assert "error_message" in result
|
||||
assert result["error_message"] is not None
|
||||
|
||||
assert success_count == 1
|
||||
assert failure_count == 1
|
||||
|
||||
async def test_markdown_endpoint_error_handling(self, async_client: httpx.AsyncClient):
|
||||
"""Test error handling for markdown endpoint."""
|
||||
# Test invalid URL
|
||||
invalid_payload = {"url": "invalid-url", "f": "fit"}
|
||||
response = await async_client.post("/md", json=invalid_payload)
|
||||
# Should return 400 for invalid URL format
|
||||
assert response.status_code == 400
|
||||
|
||||
# Test non-existent URL
|
||||
nonexistent_payload = {"url": "https://nonexistent-domain-12345.com", "f": "fit"}
|
||||
response = await async_client.post("/md", json=nonexistent_payload)
|
||||
# Should return 500 for crawl failure
|
||||
assert response.status_code == 500
|
||||
|
||||
async def test_html_endpoint_error_handling(self, async_client: httpx.AsyncClient):
|
||||
"""Test error handling for HTML endpoint."""
|
||||
# Test invalid URL
|
||||
invalid_payload = {"url": "invalid-url"}
|
||||
response = await async_client.post("/html", json=invalid_payload)
|
||||
# Should return 500 for crawl failure
|
||||
assert response.status_code == 500
|
||||
|
||||
async def test_screenshot_endpoint_error_handling(self, async_client: httpx.AsyncClient):
|
||||
"""Test error handling for screenshot endpoint."""
|
||||
# Test invalid URL
|
||||
invalid_payload = {"url": "invalid-url"}
|
||||
response = await async_client.post("/screenshot", json=invalid_payload)
|
||||
# Should return 500 for crawl failure
|
||||
assert response.status_code == 500
|
||||
|
||||
async def test_pdf_endpoint_error_handling(self, async_client: httpx.AsyncClient):
|
||||
"""Test error handling for PDF endpoint."""
|
||||
# Test invalid URL
|
||||
invalid_payload = {"url": "invalid-url"}
|
||||
response = await async_client.post("/pdf", json=invalid_payload)
|
||||
# Should return 500 for crawl failure
|
||||
assert response.status_code == 500
|
||||
|
||||
async def test_execute_js_endpoint_error_handling(self, async_client: httpx.AsyncClient):
|
||||
"""Test error handling for execute_js endpoint."""
|
||||
# Test invalid URL
|
||||
invalid_payload = {"url": "invalid-url", "scripts": ["return document.title;"]}
|
||||
response = await async_client.post("/execute_js", json=invalid_payload)
|
||||
# Should return 500 for crawl failure
|
||||
assert response.status_code == 500
|
||||
|
||||
async def test_llm_endpoint_error_handling(self, async_client: httpx.AsyncClient):
|
||||
"""Test error handling for LLM endpoint."""
|
||||
# Test missing query parameter
|
||||
response = await async_client.get("/llm/https://example.com")
|
||||
assert response.status_code == 422 # FastAPI validation error, not 400
|
||||
|
||||
# Test invalid URL
|
||||
response = await async_client.get("/llm/invalid-url?q=test")
|
||||
# Should return 500 for crawl failure
|
||||
assert response.status_code == 500
|
||||
|
||||
async def test_ask_endpoint_error_handling(self, async_client: httpx.AsyncClient):
|
||||
"""Test error handling for ask endpoint."""
|
||||
# Test invalid context_type
|
||||
response = await async_client.get("/ask?context_type=invalid")
|
||||
assert response.status_code == 422 # Validation error
|
||||
|
||||
# Test invalid score_ratio
|
||||
response = await async_client.get("/ask?score_ratio=2.0") # > 1.0
|
||||
assert response.status_code == 422 # Validation error
|
||||
|
||||
# Test invalid max_results
|
||||
response = await async_client.get("/ask?max_results=0") # < 1
|
||||
assert response.status_code == 422 # Validation error
|
||||
|
||||
async def test_config_dump_error_handling(self, async_client: httpx.AsyncClient):
|
||||
"""Test error handling for config dump endpoint."""
|
||||
# Test invalid code
|
||||
invalid_payload = {"code": "invalid_code"}
|
||||
response = await async_client.post("/config/dump", json=invalid_payload)
|
||||
assert response.status_code == 400
|
||||
|
||||
# Test nested function calls (not allowed)
|
||||
nested_payload = {"code": "CrawlerRunConfig(BrowserConfig())"}
|
||||
response = await async_client.post("/config/dump", json=nested_payload)
|
||||
assert response.status_code == 400
|
||||
|
||||
async def test_malformed_request_handling(self, async_client: httpx.AsyncClient):
|
||||
"""Test handling of malformed requests."""
|
||||
# Test missing required fields
|
||||
malformed_payload = {"urls": []} # Missing browser_config and crawler_config
|
||||
response = await async_client.post("/crawl", json=malformed_payload)
|
||||
print(f"Response: {response.text}")
|
||||
assert response.status_code == 422 # Validation error
|
||||
|
||||
# Test empty URLs list
|
||||
empty_urls_payload = {
|
||||
"urls": [],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {}},
|
||||
"crawler_config": {"type": "CrawlerRunConfig", "params": {}}
|
||||
}
|
||||
response = await async_client.post("/crawl", json=empty_urls_payload)
|
||||
assert response.status_code == 422 # "At least one URL required"
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Define arguments for pytest programmatically
|
||||
# -v: verbose output
|
||||
|
||||
175
tests/test_preserve_https_for_internal_links.py
Normal file
175
tests/test_preserve_https_for_internal_links.py
Normal file
@@ -0,0 +1,175 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Final test and demo for HTTPS preservation feature (Issue #1410)
|
||||
|
||||
This demonstrates how the preserve_https_for_internal_links flag
|
||||
prevents HTTPS downgrade when servers redirect to HTTP.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
from urllib.parse import urljoin, urlparse
|
||||
|
||||
def demonstrate_issue():
|
||||
"""Show the problem: HTTPS -> HTTP redirect causes HTTP links"""
|
||||
|
||||
print("=" * 60)
|
||||
print("DEMONSTRATING THE ISSUE")
|
||||
print("=" * 60)
|
||||
|
||||
# Simulate what happens during crawling
|
||||
original_url = "https://quotes.toscrape.com/tag/deep-thoughts"
|
||||
redirected_url = "http://quotes.toscrape.com/tag/deep-thoughts/" # Server redirects to HTTP
|
||||
|
||||
# Extract a relative link
|
||||
relative_link = "/author/Albert-Einstein"
|
||||
|
||||
# Standard URL joining uses the redirected (HTTP) base
|
||||
resolved_url = urljoin(redirected_url, relative_link)
|
||||
|
||||
print(f"Original URL: {original_url}")
|
||||
print(f"Redirected to: {redirected_url}")
|
||||
print(f"Relative link: {relative_link}")
|
||||
print(f"Resolved link: {resolved_url}")
|
||||
print(f"\n❌ Problem: Link is now HTTP instead of HTTPS!")
|
||||
|
||||
return resolved_url
|
||||
|
||||
def demonstrate_solution():
|
||||
"""Show the solution: preserve HTTPS for internal links"""
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("DEMONSTRATING THE SOLUTION")
|
||||
print("=" * 60)
|
||||
|
||||
# Our normalize_url with HTTPS preservation
|
||||
def normalize_url_with_preservation(href, base_url, preserve_https=False, original_scheme=None):
|
||||
"""Normalize URL with optional HTTPS preservation"""
|
||||
|
||||
# Standard resolution
|
||||
full_url = urljoin(base_url, href.strip())
|
||||
|
||||
# Preserve HTTPS if requested
|
||||
if preserve_https and original_scheme == 'https':
|
||||
parsed_full = urlparse(full_url)
|
||||
parsed_base = urlparse(base_url)
|
||||
|
||||
# Only for same-domain links
|
||||
if parsed_full.scheme == 'http' and parsed_full.netloc == parsed_base.netloc:
|
||||
full_url = full_url.replace('http://', 'https://', 1)
|
||||
print(f" → Preserved HTTPS for {parsed_full.netloc}")
|
||||
|
||||
return full_url
|
||||
|
||||
# Same scenario as before
|
||||
original_url = "https://quotes.toscrape.com/tag/deep-thoughts"
|
||||
redirected_url = "http://quotes.toscrape.com/tag/deep-thoughts/"
|
||||
relative_link = "/author/Albert-Einstein"
|
||||
|
||||
# Without preservation (current behavior)
|
||||
resolved_without = normalize_url_with_preservation(
|
||||
relative_link, redirected_url,
|
||||
preserve_https=False, original_scheme='https'
|
||||
)
|
||||
|
||||
print(f"\nWithout preservation:")
|
||||
print(f" Result: {resolved_without}")
|
||||
|
||||
# With preservation (new feature)
|
||||
resolved_with = normalize_url_with_preservation(
|
||||
relative_link, redirected_url,
|
||||
preserve_https=True, original_scheme='https'
|
||||
)
|
||||
|
||||
print(f"\nWith preservation (preserve_https_for_internal_links=True):")
|
||||
print(f" Result: {resolved_with}")
|
||||
print(f"\n✅ Solution: Internal link stays HTTPS!")
|
||||
|
||||
return resolved_with
|
||||
|
||||
def test_edge_cases():
|
||||
"""Test important edge cases"""
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("EDGE CASES")
|
||||
print("=" * 60)
|
||||
|
||||
from urllib.parse import urljoin, urlparse
|
||||
|
||||
def preserve_https(href, base_url, original_scheme):
|
||||
"""Helper to test preservation logic"""
|
||||
full_url = urljoin(base_url, href)
|
||||
|
||||
if original_scheme == 'https':
|
||||
parsed_full = urlparse(full_url)
|
||||
parsed_base = urlparse(base_url)
|
||||
# Fixed: check for protocol-relative URLs
|
||||
if (parsed_full.scheme == 'http' and
|
||||
parsed_full.netloc == parsed_base.netloc and
|
||||
not href.strip().startswith('//')):
|
||||
full_url = full_url.replace('http://', 'https://', 1)
|
||||
|
||||
return full_url
|
||||
|
||||
test_cases = [
|
||||
# (description, href, base_url, original_scheme, should_be_https)
|
||||
("External link", "http://other.com/page", "http://example.com", "https", False),
|
||||
("Already HTTPS", "/page", "https://example.com", "https", True),
|
||||
("No original HTTPS", "/page", "http://example.com", "http", False),
|
||||
("Subdomain", "/page", "http://sub.example.com", "https", True),
|
||||
("Protocol-relative", "//example.com/page", "http://example.com", "https", False),
|
||||
]
|
||||
|
||||
for desc, href, base_url, orig_scheme, should_be_https in test_cases:
|
||||
result = preserve_https(href, base_url, orig_scheme)
|
||||
is_https = result.startswith('https://')
|
||||
status = "✅" if is_https == should_be_https else "❌"
|
||||
|
||||
print(f"\n{status} {desc}:")
|
||||
print(f" Input: {href} + {base_url}")
|
||||
print(f" Result: {result}")
|
||||
print(f" Expected HTTPS: {should_be_https}, Got: {is_https}")
|
||||
|
||||
def usage_example():
|
||||
"""Show how to use the feature in crawl4ai"""
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("USAGE IN CRAWL4AI")
|
||||
print("=" * 60)
|
||||
|
||||
print("""
|
||||
To enable HTTPS preservation in your crawl4ai code:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
preserve_https_for_internal_links=True # Enable HTTPS preservation
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=config
|
||||
)
|
||||
|
||||
# All internal links will maintain HTTPS even if
|
||||
# the server redirects to HTTP
|
||||
```
|
||||
|
||||
This is especially useful for:
|
||||
- Sites that redirect HTTPS to HTTP but still support HTTPS
|
||||
- Security-conscious crawling where you want to stay on HTTPS
|
||||
- Avoiding mixed content issues in downstream processing
|
||||
""")
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Run all demonstrations
|
||||
demonstrate_issue()
|
||||
demonstrate_solution()
|
||||
test_edge_cases()
|
||||
usage_example()
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("✅ All tests complete!")
|
||||
print("=" * 60)
|
||||
@@ -1,305 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
#############################################################################
|
||||
# Webhook Feature Test Script
|
||||
#
|
||||
# This script tests the webhook feature implementation by:
|
||||
# 1. Switching to the webhook feature branch
|
||||
# 2. Installing dependencies
|
||||
# 3. Starting the server
|
||||
# 4. Running webhook tests
|
||||
# 5. Cleaning up and returning to original branch
|
||||
#
|
||||
# Usage: ./test_webhook_feature.sh
|
||||
#############################################################################
|
||||
|
||||
set -e # Exit on error
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Configuration
|
||||
BRANCH_NAME="claude/implement-webhook-crawl-feature-011CULZY1Jy8N5MUkZqXkRVp"
|
||||
VENV_PATH="venv"
|
||||
SERVER_PORT=11235
|
||||
WEBHOOK_PORT=8080
|
||||
PROJECT_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
|
||||
# PID files for cleanup
|
||||
REDIS_PID=""
|
||||
SERVER_PID=""
|
||||
WEBHOOK_PID=""
|
||||
|
||||
#############################################################################
|
||||
# Utility Functions
|
||||
#############################################################################
|
||||
|
||||
log_info() {
|
||||
echo -e "${BLUE}[INFO]${NC} $1"
|
||||
}
|
||||
|
||||
log_success() {
|
||||
echo -e "${GREEN}[SUCCESS]${NC} $1"
|
||||
}
|
||||
|
||||
log_warning() {
|
||||
echo -e "${YELLOW}[WARNING]${NC} $1"
|
||||
}
|
||||
|
||||
log_error() {
|
||||
echo -e "${RED}[ERROR]${NC} $1"
|
||||
}
|
||||
|
||||
cleanup() {
|
||||
log_info "Starting cleanup..."
|
||||
|
||||
# Kill webhook receiver if running
|
||||
if [ ! -z "$WEBHOOK_PID" ] && kill -0 $WEBHOOK_PID 2>/dev/null; then
|
||||
log_info "Stopping webhook receiver (PID: $WEBHOOK_PID)..."
|
||||
kill $WEBHOOK_PID 2>/dev/null || true
|
||||
fi
|
||||
|
||||
# Kill server if running
|
||||
if [ ! -z "$SERVER_PID" ] && kill -0 $SERVER_PID 2>/dev/null; then
|
||||
log_info "Stopping server (PID: $SERVER_PID)..."
|
||||
kill $SERVER_PID 2>/dev/null || true
|
||||
fi
|
||||
|
||||
# Kill Redis if running
|
||||
if [ ! -z "$REDIS_PID" ] && kill -0 $REDIS_PID 2>/dev/null; then
|
||||
log_info "Stopping Redis (PID: $REDIS_PID)..."
|
||||
kill $REDIS_PID 2>/dev/null || true
|
||||
fi
|
||||
|
||||
# Also kill by port if PIDs didn't work
|
||||
lsof -ti:$SERVER_PORT | xargs kill -9 2>/dev/null || true
|
||||
lsof -ti:$WEBHOOK_PORT | xargs kill -9 2>/dev/null || true
|
||||
lsof -ti:6379 | xargs kill -9 2>/dev/null || true
|
||||
|
||||
# Return to original branch
|
||||
if [ ! -z "$ORIGINAL_BRANCH" ]; then
|
||||
log_info "Switching back to branch: $ORIGINAL_BRANCH"
|
||||
git checkout $ORIGINAL_BRANCH 2>/dev/null || true
|
||||
fi
|
||||
|
||||
log_success "Cleanup complete"
|
||||
}
|
||||
|
||||
# Set trap to cleanup on exit
|
||||
trap cleanup EXIT INT TERM
|
||||
|
||||
#############################################################################
|
||||
# Main Script
|
||||
#############################################################################
|
||||
|
||||
log_info "Starting webhook feature test script"
|
||||
log_info "Project root: $PROJECT_ROOT"
|
||||
|
||||
cd "$PROJECT_ROOT"
|
||||
|
||||
# Step 1: Save current branch and fetch PR
|
||||
log_info "Step 1: Fetching PR branch..."
|
||||
ORIGINAL_BRANCH=$(git rev-parse --abbrev-ref HEAD)
|
||||
log_info "Current branch: $ORIGINAL_BRANCH"
|
||||
|
||||
git fetch origin $BRANCH_NAME
|
||||
log_success "Branch fetched"
|
||||
|
||||
# Step 2: Switch to new branch
|
||||
log_info "Step 2: Switching to branch: $BRANCH_NAME"
|
||||
git checkout $BRANCH_NAME
|
||||
log_success "Switched to webhook feature branch"
|
||||
|
||||
# Step 3: Activate virtual environment
|
||||
log_info "Step 3: Activating virtual environment..."
|
||||
if [ ! -d "$VENV_PATH" ]; then
|
||||
log_error "Virtual environment not found at $VENV_PATH"
|
||||
log_info "Creating virtual environment..."
|
||||
python3 -m venv $VENV_PATH
|
||||
fi
|
||||
|
||||
source $VENV_PATH/bin/activate
|
||||
log_success "Virtual environment activated: $(which python)"
|
||||
|
||||
# Step 4: Install server dependencies
|
||||
log_info "Step 4: Installing server dependencies..."
|
||||
pip install -q -r deploy/docker/requirements.txt
|
||||
log_success "Dependencies installed"
|
||||
|
||||
# Check if Redis is available
|
||||
log_info "Checking Redis availability..."
|
||||
if ! command -v redis-server &> /dev/null; then
|
||||
log_warning "Redis not found, attempting to install..."
|
||||
if command -v apt-get &> /dev/null; then
|
||||
sudo apt-get update && sudo apt-get install -y redis-server
|
||||
elif command -v brew &> /dev/null; then
|
||||
brew install redis
|
||||
else
|
||||
log_error "Cannot install Redis automatically. Please install Redis manually."
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
# Step 5: Start Redis in background
|
||||
log_info "Step 5a: Starting Redis..."
|
||||
redis-server --port 6379 --daemonize yes
|
||||
sleep 2
|
||||
REDIS_PID=$(pgrep redis-server)
|
||||
log_success "Redis started (PID: $REDIS_PID)"
|
||||
|
||||
# Step 5b: Start server in background
|
||||
log_info "Step 5b: Starting server on port $SERVER_PORT..."
|
||||
cd deploy/docker
|
||||
|
||||
# Start server in background
|
||||
python3 -m uvicorn server:app --host 0.0.0.0 --port $SERVER_PORT > /tmp/crawl4ai_server.log 2>&1 &
|
||||
SERVER_PID=$!
|
||||
cd "$PROJECT_ROOT"
|
||||
|
||||
log_info "Server started (PID: $SERVER_PID)"
|
||||
|
||||
# Wait for server to be ready
|
||||
log_info "Waiting for server to be ready..."
|
||||
for i in {1..30}; do
|
||||
if curl -s http://localhost:$SERVER_PORT/health > /dev/null 2>&1; then
|
||||
log_success "Server is ready!"
|
||||
break
|
||||
fi
|
||||
if [ $i -eq 30 ]; then
|
||||
log_error "Server failed to start within 30 seconds"
|
||||
log_info "Server logs:"
|
||||
tail -50 /tmp/crawl4ai_server.log
|
||||
exit 1
|
||||
fi
|
||||
echo -n "."
|
||||
sleep 1
|
||||
done
|
||||
echo ""
|
||||
|
||||
# Step 6: Create and run webhook test
|
||||
log_info "Step 6: Creating webhook test script..."
|
||||
|
||||
cat > /tmp/test_webhook.py << 'PYTHON_SCRIPT'
|
||||
import requests
|
||||
import json
|
||||
import time
|
||||
from flask import Flask, request, jsonify
|
||||
from threading import Thread, Event
|
||||
|
||||
# Configuration
|
||||
CRAWL4AI_BASE_URL = "http://localhost:11235"
|
||||
WEBHOOK_BASE_URL = "http://localhost:8080"
|
||||
|
||||
# Flask app for webhook receiver
|
||||
app = Flask(__name__)
|
||||
webhook_received = Event()
|
||||
webhook_data = {}
|
||||
|
||||
@app.route('/webhook', methods=['POST'])
|
||||
def handle_webhook():
|
||||
global webhook_data
|
||||
webhook_data = request.json
|
||||
webhook_received.set()
|
||||
print(f"\n✅ Webhook received: {json.dumps(webhook_data, indent=2)}")
|
||||
return jsonify({"status": "received"}), 200
|
||||
|
||||
def start_webhook_server():
|
||||
app.run(host='0.0.0.0', port=8080, debug=False, use_reloader=False)
|
||||
|
||||
# Start webhook server in background
|
||||
webhook_thread = Thread(target=start_webhook_server, daemon=True)
|
||||
webhook_thread.start()
|
||||
time.sleep(2)
|
||||
|
||||
print("🚀 Submitting crawl job with webhook...")
|
||||
|
||||
# Submit job with webhook
|
||||
payload = {
|
||||
"urls": ["https://example.com"],
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {"cache_mode": "bypass"},
|
||||
"webhook_config": {
|
||||
"webhook_url": f"{WEBHOOK_BASE_URL}/webhook",
|
||||
"webhook_data_in_payload": True
|
||||
}
|
||||
}
|
||||
|
||||
response = requests.post(
|
||||
f"{CRAWL4AI_BASE_URL}/crawl/job",
|
||||
json=payload,
|
||||
headers={"Content-Type": "application/json"}
|
||||
)
|
||||
|
||||
if not response.ok:
|
||||
print(f"❌ Failed to submit job: {response.text}")
|
||||
exit(1)
|
||||
|
||||
task_id = response.json()['task_id']
|
||||
print(f"✅ Job submitted successfully, task_id: {task_id}")
|
||||
|
||||
# Wait for webhook (with timeout)
|
||||
print("⏳ Waiting for webhook notification...")
|
||||
if webhook_received.wait(timeout=60):
|
||||
print(f"✅ Webhook received!")
|
||||
print(f" Task ID: {webhook_data.get('task_id')}")
|
||||
print(f" Status: {webhook_data.get('status')}")
|
||||
print(f" URLs: {webhook_data.get('urls')}")
|
||||
|
||||
if webhook_data.get('status') == 'completed':
|
||||
if 'data' in webhook_data:
|
||||
print(f" ✅ Data included in webhook payload")
|
||||
results = webhook_data['data'].get('results', [])
|
||||
if results:
|
||||
print(f" 📄 Crawled {len(results)} URL(s)")
|
||||
for result in results:
|
||||
print(f" - {result.get('url')}: {len(result.get('markdown', ''))} chars")
|
||||
print("\n🎉 Webhook test PASSED!")
|
||||
exit(0)
|
||||
else:
|
||||
print(f" ❌ Job failed: {webhook_data.get('error')}")
|
||||
exit(1)
|
||||
else:
|
||||
print("❌ Webhook not received within 60 seconds")
|
||||
# Try polling as fallback
|
||||
print("⏳ Trying to poll job status...")
|
||||
for i in range(10):
|
||||
status_response = requests.get(f"{CRAWL4AI_BASE_URL}/crawl/job/{task_id}")
|
||||
if status_response.ok:
|
||||
status = status_response.json()
|
||||
print(f" Status: {status.get('status')}")
|
||||
if status.get('status') in ['completed', 'failed']:
|
||||
break
|
||||
time.sleep(2)
|
||||
exit(1)
|
||||
PYTHON_SCRIPT
|
||||
|
||||
# Install Flask for webhook receiver
|
||||
pip install -q flask
|
||||
|
||||
# Run the webhook test
|
||||
log_info "Running webhook test..."
|
||||
python3 /tmp/test_webhook.py &
|
||||
WEBHOOK_PID=$!
|
||||
|
||||
# Wait for test to complete
|
||||
wait $WEBHOOK_PID
|
||||
TEST_EXIT_CODE=$?
|
||||
|
||||
# Step 7: Verify results
|
||||
log_info "Step 7: Verifying test results..."
|
||||
if [ $TEST_EXIT_CODE -eq 0 ]; then
|
||||
log_success "✅ Webhook test PASSED!"
|
||||
else
|
||||
log_error "❌ Webhook test FAILED (exit code: $TEST_EXIT_CODE)"
|
||||
log_info "Server logs:"
|
||||
tail -100 /tmp/crawl4ai_server.log
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Step 8: Cleanup happens automatically via trap
|
||||
log_success "All tests completed successfully! 🎉"
|
||||
log_info "Cleanup will happen automatically..."
|
||||
Reference in New Issue
Block a user