Merge branch 'release/v0.7.0' - The Adaptive Intelligence Update

This commit is contained in:
UncleCode
2025-07-12 18:54:20 +08:00
320 changed files with 115071 additions and 514 deletions

View File

@@ -0,0 +1,432 @@
# Advanced Adaptive Strategies
## Overview
While the default adaptive crawling configuration works well for most use cases, understanding the underlying strategies and scoring mechanisms allows you to fine-tune the crawler for specific domains and requirements.
## The Three-Layer Scoring System
### 1. Coverage Score
Coverage measures how comprehensively your knowledge base covers the query terms and related concepts.
#### Mathematical Foundation
```python
Coverage(K, Q) = Σ(t Q) score(t, K) / |Q|
where score(t, K) = doc_coverage(t) × (1 + freq_boost(t))
```
#### Components
- **Document Coverage**: Percentage of documents containing the term
- **Frequency Boost**: Logarithmic bonus for term frequency
- **Query Decomposition**: Handles multi-word queries intelligently
#### Tuning Coverage
```python
# For technical documentation with specific terminology
config = AdaptiveConfig(
confidence_threshold=0.85, # Require high coverage
top_k_links=5 # Cast wider net
)
# For general topics with synonyms
config = AdaptiveConfig(
confidence_threshold=0.6, # Lower threshold
top_k_links=2 # More focused
)
```
### 2. Consistency Score
Consistency evaluates whether the information across pages is coherent and non-contradictory.
#### How It Works
1. Extracts key statements from each document
2. Compares statements across documents
3. Measures agreement vs. contradiction
4. Returns normalized score (0-1)
#### Practical Impact
- **High consistency (>0.8)**: Information is reliable and coherent
- **Medium consistency (0.5-0.8)**: Some variation, but generally aligned
- **Low consistency (<0.5)**: Conflicting information, need more sources
### 3. Saturation Score
Saturation detects when new pages stop providing novel information.
#### Detection Algorithm
```python
# Tracks new unique terms per page
new_terms_page_1 = 50
new_terms_page_2 = 30 # 60% of first
new_terms_page_3 = 15 # 50% of second
new_terms_page_4 = 5 # 33% of third
# Saturation detected: rapidly diminishing returns
```
#### Configuration
```python
config = AdaptiveConfig(
min_gain_threshold=0.1 # Stop if <10% new information
)
```
## Link Ranking Algorithm
### Expected Information Gain
Each uncrawled link is scored based on:
```python
ExpectedGain(link) = Relevance × Novelty × Authority
```
#### 1. Relevance Scoring
Uses BM25 algorithm on link preview text:
```python
relevance = BM25(link.preview_text, query)
```
Factors:
- Term frequency in preview
- Inverse document frequency
- Preview length normalization
#### 2. Novelty Estimation
Measures how different the link appears from already-crawled content:
```python
novelty = 1 - max_similarity(preview, knowledge_base)
```
Prevents crawling duplicate or highly similar pages.
#### 3. Authority Calculation
URL structure and domain analysis:
```python
authority = f(domain_rank, url_depth, url_structure)
```
Factors:
- Domain reputation
- URL depth (fewer slashes = higher authority)
- Clean URL structure
### Custom Link Scoring
```python
class CustomLinkScorer:
def score(self, link: Link, query: str, state: CrawlState) -> float:
# Prioritize specific URL patterns
if "/api/reference/" in link.href:
return 2.0 # Double the score
# Deprioritize certain sections
if "/archive/" in link.href:
return 0.1 # Reduce score by 90%
# Default scoring
return 1.0
# Use with adaptive crawler
adaptive = AdaptiveCrawler(
crawler,
config=config,
link_scorer=CustomLinkScorer()
)
```
## Domain-Specific Configurations
### Technical Documentation
```python
tech_doc_config = AdaptiveConfig(
confidence_threshold=0.85,
max_pages=30,
top_k_links=3,
min_gain_threshold=0.05 # Keep crawling for small gains
)
```
Rationale:
- High threshold ensures comprehensive coverage
- Lower gain threshold captures edge cases
- Moderate link following for depth
### News & Articles
```python
news_config = AdaptiveConfig(
confidence_threshold=0.6,
max_pages=10,
top_k_links=5,
min_gain_threshold=0.15 # Stop quickly on repetition
)
```
Rationale:
- Lower threshold (articles often repeat information)
- Higher gain threshold (avoid duplicate stories)
- More links per page (explore different perspectives)
### E-commerce
```python
ecommerce_config = AdaptiveConfig(
confidence_threshold=0.7,
max_pages=20,
top_k_links=2,
min_gain_threshold=0.1
)
```
Rationale:
- Balanced threshold for product variations
- Focused link following (avoid infinite products)
- Standard gain threshold
### Research & Academic
```python
research_config = AdaptiveConfig(
confidence_threshold=0.9,
max_pages=50,
top_k_links=4,
min_gain_threshold=0.02 # Very low - capture citations
)
```
Rationale:
- Very high threshold for completeness
- Many pages allowed for thorough research
- Very low gain threshold to capture references
## Performance Optimization
### Memory Management
```python
# For large crawls, use streaming
config = AdaptiveConfig(
max_pages=100,
save_state=True,
state_path="large_crawl.json"
)
# Periodically clean state
if len(state.knowledge_base) > 1000:
# Keep only most relevant
state.knowledge_base = get_top_relevant(state.knowledge_base, 500)
```
### Parallel Processing
```python
# Use multiple start points
start_urls = [
"https://docs.example.com/intro",
"https://docs.example.com/api",
"https://docs.example.com/guides"
]
# Crawl in parallel
tasks = [
adaptive.digest(url, query)
for url in start_urls
]
results = await asyncio.gather(*tasks)
```
### Caching Strategy
```python
# Enable caching for repeated crawls
async with AsyncWebCrawler(
config=BrowserConfig(
cache_mode=CacheMode.ENABLED
)
) as crawler:
adaptive = AdaptiveCrawler(crawler, config)
```
## Debugging & Analysis
### Enable Verbose Logging
```python
import logging
logging.basicConfig(level=logging.DEBUG)
adaptive = AdaptiveCrawler(crawler, config, verbose=True)
```
### Analyze Crawl Patterns
```python
# After crawling
state = await adaptive.digest(start_url, query)
# Analyze link selection
print("Link selection order:")
for i, url in enumerate(state.crawl_order):
print(f"{i+1}. {url}")
# Analyze term discovery
print("\nTerm discovery rate:")
for i, new_terms in enumerate(state.new_terms_history):
print(f"Page {i+1}: {new_terms} new terms")
# Analyze score progression
print("\nScore progression:")
print(f"Coverage: {state.metrics['coverage_history']}")
print(f"Saturation: {state.metrics['saturation_history']}")
```
### Export for Analysis
```python
# Export detailed metrics
import json
metrics = {
"query": query,
"total_pages": len(state.crawled_urls),
"confidence": adaptive.confidence,
"coverage_stats": adaptive.coverage_stats,
"crawl_order": state.crawl_order,
"term_frequencies": dict(state.term_frequencies),
"new_terms_history": state.new_terms_history
}
with open("crawl_analysis.json", "w") as f:
json.dump(metrics, f, indent=2)
```
## Custom Strategies
### Implementing a Custom Strategy
```python
from crawl4ai.adaptive_crawler import BaseStrategy
class DomainSpecificStrategy(BaseStrategy):
def calculate_coverage(self, state: CrawlState) -> float:
# Custom coverage calculation
# e.g., weight certain terms more heavily
pass
def calculate_consistency(self, state: CrawlState) -> float:
# Custom consistency logic
# e.g., domain-specific validation
pass
def rank_links(self, links: List[Link], state: CrawlState) -> List[Link]:
# Custom link ranking
# e.g., prioritize specific URL patterns
pass
# Use custom strategy
adaptive = AdaptiveCrawler(
crawler,
config=config,
strategy=DomainSpecificStrategy()
)
```
### Combining Strategies
```python
class HybridStrategy(BaseStrategy):
def __init__(self):
self.strategies = [
TechnicalDocStrategy(),
SemanticSimilarityStrategy(),
URLPatternStrategy()
]
def calculate_confidence(self, state: CrawlState) -> float:
# Weighted combination of strategies
scores = [s.calculate_confidence(state) for s in self.strategies]
weights = [0.5, 0.3, 0.2]
return sum(s * w for s, w in zip(scores, weights))
```
## Best Practices
### 1. Start Conservative
Begin with default settings and adjust based on results:
```python
# Start with defaults
result = await adaptive.digest(url, query)
# Analyze and adjust
if adaptive.confidence < 0.7:
config.max_pages += 10
config.confidence_threshold -= 0.1
```
### 2. Monitor Resource Usage
```python
import psutil
# Check memory before large crawls
memory_percent = psutil.virtual_memory().percent
if memory_percent > 80:
config.max_pages = min(config.max_pages, 20)
```
### 3. Use Domain Knowledge
```python
# For API documentation
if "api" in start_url:
config.top_k_links = 2 # APIs have clear structure
# For blogs
if "blog" in start_url:
config.min_gain_threshold = 0.2 # Avoid similar posts
```
### 4. Validate Results
```python
# Always validate the knowledge base
relevant_content = adaptive.get_relevant_content(top_k=10)
# Check coverage
query_terms = set(query.lower().split())
covered_terms = set()
for doc in relevant_content:
content_lower = doc['content'].lower()
for term in query_terms:
if term in content_lower:
covered_terms.add(term)
coverage_ratio = len(covered_terms) / len(query_terms)
print(f"Query term coverage: {coverage_ratio:.0%}")
```
## Next Steps
- Explore [Custom Strategy Implementation](../tutorials/custom-adaptive-strategies.md)
- Learn about [Knowledge Base Management](../tutorials/knowledge-base-management.md)
- See [Performance Benchmarks](../benchmarks/adaptive-performance.md)

View File

@@ -66,29 +66,38 @@ Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI c
```python
import os, asyncio
from base64 import b64decode
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
async def main():
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True,
pdf=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
cache_mode=CacheMode.BYPASS,
pdf=True,
screenshot=True
config=run_config
)
if result.success:
# Save screenshot
print(f"Screenshot data present: {result.screenshot is not None}")
print(f"PDF data present: {result.pdf is not None}")
if result.screenshot:
print(f"[OK] Screenshot captured, size: {len(result.screenshot)} bytes")
with open("wikipedia_screenshot.png", "wb") as f:
f.write(b64decode(result.screenshot))
# Save PDF
else:
print("[WARN] Screenshot data is None.")
if result.pdf:
print(f"[OK] PDF captured, size: {len(result.pdf)} bytes")
with open("wikipedia_page.pdf", "wb") as f:
f.write(result.pdf)
print("[OK] PDF & screenshot captured.")
else:
print("[WARN] PDF data is None.")
else:
print("[ERROR]", result.error_message)

View File

@@ -6,7 +6,7 @@ Many websites now load images **lazily** as you scroll. If you need to ensure th
2. **`scan_full_page`** Force the crawler to scroll the entire page, triggering lazy loads.
3. **`scroll_delay`** Add small delays between scroll steps.
**Note**: If the site requires multiple “Load More” triggers or complex interactions, see the [Page Interaction docs](../core/page-interaction.md).
**Note**: If the site requires multiple “Load More” triggers or complex interactions, see the [Page Interaction docs](../core/page-interaction.md). For sites with virtual scrolling (Twitter/Instagram style), see the [Virtual Scroll docs](virtual-scroll.md).
### Example: Ensuring Lazy Images Appear

View File

@@ -172,7 +172,7 @@ dispatcher = MemoryAdaptiveDispatcher(
3.**`max_session_permit`** (`int`, default: `10`)
The maximum number of concurrent crawling tasks allowed. This ensures resource limits are respected while maintaining concurrency.
4.**`memory_wait_timeout`** (`float`, default: `300.0`)
4.**`memory_wait_timeout`** (`float`, default: `600.0`)
Optional timeout (in seconds). If memory usage exceeds `memory_threshold_percent` for longer than this duration, a `MemoryError` is raised.
5.**`rate_limiter`** (`RateLimiter`, default: `None`)

View File

@@ -0,0 +1,201 @@
# PDF Processing Strategies
Crawl4AI provides specialized strategies for handling and extracting content from PDF files. These strategies allow you to seamlessly integrate PDF processing into your crawling workflows, whether the PDFs are hosted online or stored locally.
## `PDFCrawlerStrategy`
### Overview
`PDFCrawlerStrategy` is an implementation of `AsyncCrawlerStrategy` designed specifically for PDF documents. Instead of interpreting the input URL as an HTML webpage, this strategy treats it as a pointer to a PDF file. It doesn't perform deep crawling or HTML parsing itself but rather prepares the PDF source for a dedicated PDF scraping strategy. Its primary role is to identify the PDF source (web URL or local file) and pass it along the processing pipeline in a way that `AsyncWebCrawler` can handle.
### When to Use
Use `PDFCrawlerStrategy` when you need to:
- Process PDF files using the `AsyncWebCrawler`.
- Handle PDFs from both web URLs (e.g., `https://example.com/document.pdf`) and local file paths (e.g., `file:///path/to/your/document.pdf`).
- Integrate PDF content extraction into a unified `CrawlResult` object, allowing consistent handling of PDF data alongside web page data.
### Key Methods and Their Behavior
- **`__init__(self, logger: AsyncLogger = None)`**:
- Initializes the strategy.
- `logger`: An optional `AsyncLogger` instance (from `crawl4ai.async_logger`) for logging purposes.
- **`async crawl(self, url: str, **kwargs) -> AsyncCrawlResponse`**:
- This method is called by the `AsyncWebCrawler` during the `arun` process.
- It takes the `url` (which should point to a PDF) and creates a minimal `AsyncCrawlResponse`.
- The `html` attribute of this response is typically empty or a placeholder, as the actual PDF content processing is deferred to the `PDFContentScrapingStrategy` (or a similar PDF-aware scraping strategy).
- It sets `response_headers` to indicate "application/pdf" and `status_code` to 200.
- **`async close(self)`**:
- A method for cleaning up any resources used by the strategy. For `PDFCrawlerStrategy`, this is usually minimal.
- **`async __aenter__(self)` / `async __aexit__(self, exc_type, exc_val, exc_tb)`**:
- Enables asynchronous context management for the strategy, allowing it to be used with `async with`.
### Example Usage
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
async def main():
# Initialize the PDF crawler strategy
pdf_crawler_strategy = PDFCrawlerStrategy()
# PDFCrawlerStrategy is typically used in conjunction with PDFContentScrapingStrategy
# The scraping strategy handles the actual PDF content extraction
pdf_scraping_strategy = PDFContentScrapingStrategy()
run_config = CrawlerRunConfig(scraping_strategy=pdf_scraping_strategy)
async with AsyncWebCrawler(crawler_strategy=pdf_crawler_strategy) as crawler:
# Example with a remote PDF URL
pdf_url = "https://arxiv.org/pdf/2310.06825.pdf" # A public PDF from arXiv
print(f"Attempting to process PDF: {pdf_url}")
result = await crawler.arun(url=pdf_url, config=run_config)
if result.success:
print(f"Successfully processed PDF: {result.url}")
print(f"Metadata Title: {result.metadata.get('title', 'N/A')}")
# Further processing of result.markdown, result.media, etc.
# would be done here, based on what PDFContentScrapingStrategy extracts.
if result.markdown and hasattr(result.markdown, 'raw_markdown'):
print(f"Extracted text (first 200 chars): {result.markdown.raw_markdown[:200]}...")
else:
print("No markdown (text) content extracted.")
else:
print(f"Failed to process PDF: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
### Pros and Cons
**Pros:**
- Enables `AsyncWebCrawler` to handle PDF sources directly using familiar `arun` calls.
- Provides a consistent interface for specifying PDF sources (URLs or local paths).
- Abstracts the source handling, allowing a separate scraping strategy to focus on PDF content parsing.
**Cons:**
- Does not perform any PDF data extraction itself; it strictly relies on a compatible scraping strategy (like `PDFContentScrapingStrategy`) to process the PDF.
- Has limited utility on its own; most of its value comes from being paired with a PDF-specific content scraping strategy.
---
## `PDFContentScrapingStrategy`
### Overview
`PDFContentScrapingStrategy` is an implementation of `ContentScrapingStrategy` designed to extract text, metadata, and optionally images from PDF documents. It is intended to be used in conjunction with a crawler strategy that can provide it with a PDF source, such as `PDFCrawlerStrategy`. This strategy uses the `NaivePDFProcessorStrategy` internally to perform the low-level PDF parsing.
### When to Use
Use `PDFContentScrapingStrategy` when your `AsyncWebCrawler` (often configured with `PDFCrawlerStrategy`) needs to:
- Extract textual content page by page from a PDF document.
- Retrieve standard metadata embedded within the PDF (e.g., title, author, subject, creation date, page count).
- Optionally, extract images contained within the PDF pages. These images can be saved to a local directory or made available for further processing.
- Produce a `ScrapingResult` that can be converted into a `CrawlResult`, making PDF content accessible in a manner similar to HTML web content (e.g., text in `result.markdown`, metadata in `result.metadata`).
### Key Configuration Attributes
When initializing `PDFContentScrapingStrategy`, you can configure its behavior using the following attributes:
- **`extract_images: bool = False`**: If `True`, the strategy will attempt to extract images from the PDF.
- **`save_images_locally: bool = False`**: If `True` (and `extract_images` is also `True`), extracted images will be saved to disk in the `image_save_dir`. If `False`, image data might be available in another form (e.g., base64, depending on the underlying processor) but not saved as separate files by this strategy.
- **`image_save_dir: str = None`**: Specifies the directory where extracted images should be saved if `save_images_locally` is `True`. If `None`, a default or temporary directory might be used.
- **`batch_size: int = 4`**: Defines how many PDF pages are processed in a single batch. This can be useful for managing memory when dealing with very large PDF documents.
- **`logger: AsyncLogger = None`**: An optional `AsyncLogger` instance for logging.
### Key Methods and Their Behavior
- **`__init__(self, save_images_locally: bool = False, extract_images: bool = False, image_save_dir: str = None, batch_size: int = 4, logger: AsyncLogger = None)`**:
- Initializes the strategy with configurations for image handling, batch processing, and logging. It sets up an internal `NaivePDFProcessorStrategy` instance which performs the actual PDF parsing.
- **`scrap(self, url: str, html: str, **params) -> ScrapingResult`**:
- This is the primary synchronous method called by the crawler (via `ascrap`) to process the PDF.
- `url`: The path or URL to the PDF file (provided by `PDFCrawlerStrategy` or similar).
- `html`: Typically an empty string when used with `PDFCrawlerStrategy`, as the content is a PDF, not HTML.
- It first ensures the PDF is accessible locally (downloads it to a temporary file if `url` is remote).
- It then uses its internal PDF processor to extract text, metadata, and images (if configured).
- The extracted information is compiled into a `ScrapingResult` object:
- `cleaned_html`: Contains an HTML-like representation of the PDF, where each page's content is often wrapped in a `<div>` with page number information.
- `media`: A dictionary where `media["images"]` will contain information about extracted images if `extract_images` was `True`.
- `links`: A dictionary where `links["urls"]` can contain URLs found within the PDF content.
- `metadata`: A dictionary holding PDF metadata (e.g., title, author, num_pages).
- **`async ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult`**:
- The asynchronous version of `scrap`. Under the hood, it typically runs the synchronous `scrap` method in a separate thread using `asyncio.to_thread` to avoid blocking the event loop.
- **`_get_pdf_path(self, url: str) -> str`**:
- A private helper method to manage PDF file access. If the `url` is remote (http/https), it downloads the PDF to a temporary local file and returns its path. If `url` indicates a local file (`file://` or a direct path), it resolves and returns the local path.
### Example Usage
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
import os # For creating image directory
async def main():
# Define the directory for saving extracted images
image_output_dir = "./my_pdf_images"
os.makedirs(image_output_dir, exist_ok=True)
# Configure the PDF content scraping strategy
# Enable image extraction and specify where to save them
pdf_scraping_cfg = PDFContentScrapingStrategy(
extract_images=True,
save_images_locally=True,
image_save_dir=image_output_dir,
batch_size=2 # Process 2 pages at a time for demonstration
)
# The PDFCrawlerStrategy is needed to tell AsyncWebCrawler how to "crawl" a PDF
pdf_crawler_cfg = PDFCrawlerStrategy()
# Configure the overall crawl run
run_cfg = CrawlerRunConfig(
scraping_strategy=pdf_scraping_cfg # Use our PDF scraping strategy
)
# Initialize the crawler with the PDF-specific crawler strategy
async with AsyncWebCrawler(crawler_strategy=pdf_crawler_cfg) as crawler:
pdf_url = "https://arxiv.org/pdf/2310.06825.pdf" # Example PDF
print(f"Starting PDF processing for: {pdf_url}")
result = await crawler.arun(url=pdf_url, config=run_cfg)
if result.success:
print("\n--- PDF Processing Successful ---")
print(f"Processed URL: {result.url}")
print("\n--- Metadata ---")
for key, value in result.metadata.items():
print(f" {key.replace('_', ' ').title()}: {value}")
if result.markdown and hasattr(result.markdown, 'raw_markdown'):
print(f"\n--- Extracted Text (Markdown Snippet) ---")
print(result.markdown.raw_markdown[:500].strip() + "...")
else:
print("\nNo text (markdown) content extracted.")
if result.media and result.media.get("images"):
print(f"\n--- Image Extraction ---")
print(f"Extracted {len(result.media['images'])} image(s).")
for i, img_info in enumerate(result.media["images"][:2]): # Show info for first 2 images
print(f" Image {i+1}:")
print(f" Page: {img_info.get('page')}")
print(f" Format: {img_info.get('format', 'N/A')}")
if img_info.get('path'):
print(f" Saved at: {img_info.get('path')}")
else:
print("\nNo images were extracted (or extract_images was False).")
else:
print(f"\n--- PDF Processing Failed ---")
print(f"Error: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
### Pros and Cons
**Pros:**
- Provides a comprehensive way to extract text, metadata, and (optionally) images from PDF documents.
- Handles both remote PDFs (via URL) and local PDF files.
- Configurable image extraction allows saving images to disk or accessing their data.
- Integrates smoothly with the `CrawlResult` object structure, making PDF-derived data accessible in a way consistent with web-scraped data.
- The `batch_size` parameter can help in managing memory consumption when processing large or numerous PDF pages.
**Cons:**
- Extraction quality and performance can vary significantly depending on the PDF's complexity, encoding, and whether it's image-based (scanned) or text-based.
- Image extraction can be resource-intensive (both CPU and disk space if `save_images_locally` is true).
- Relies on `NaivePDFProcessorStrategy` internally, which might have limitations with very complex layouts, encrypted PDFs, or forms compared to more sophisticated PDF parsing libraries. Scanned PDFs will not yield text unless an OCR step is performed (which is not part of this strategy by default).
- Link extraction from PDFs can be basic and depends on how hyperlinks are embedded in the document.

View File

@@ -25,44 +25,70 @@ Use an authenticated proxy with `BrowserConfig`:
```python
from crawl4ai.async_configs import BrowserConfig
proxy_config = {
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
}
browser_config = BrowserConfig(proxy_config=proxy_config)
browser_config = BrowserConfig(proxy="http://[username]:[password]@[host]:[port]")
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
```
Here's the corrected documentation:
## Rotating Proxies
Example using a proxy rotation service dynamically:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def get_next_proxy():
# Your proxy rotation logic here
return {"server": "http://next.proxy.com:8080"}
import re
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CacheMode,
RoundRobinProxyStrategy,
)
import asyncio
from crawl4ai import ProxyConfig
async def main():
browser_config = BrowserConfig()
run_config = CrawlerRunConfig()
async with AsyncWebCrawler(config=browser_config) as crawler:
# For each URL, create a new run config with different proxy
for url in urls:
proxy = await get_next_proxy()
# Clone the config and update proxy - this creates a new browser context
current_config = run_config.clone(proxy_config=proxy)
result = await crawler.arun(url=url, config=current_config)
# Load proxies and create rotation strategy
proxies = ProxyConfig.from_env()
#eg: export PROXIES="ip1:port1:username1:password1,ip2:port2:username2:password2"
if not proxies:
print("No proxies found in environment. Set PROXIES env variable!")
return
proxy_strategy = RoundRobinProxyStrategy(proxies)
# Create configs
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
proxy_rotation_strategy=proxy_strategy
)
async with AsyncWebCrawler(config=browser_config) as crawler:
urls = ["https://httpbin.org/ip"] * (len(proxies) * 2) # Test each proxy twice
print("\n📈 Initializing crawler with proxy rotation...")
async with AsyncWebCrawler(config=browser_config) as crawler:
print("\n🚀 Starting batch crawl with proxy rotation...")
results = await crawler.arun_many(
urls=urls,
config=run_config
)
for result in results:
if result.success:
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
current_proxy = run_config.proxy_config if run_config.proxy_config else None
if current_proxy and ip_match:
print(f"URL {result.url}")
print(f"Proxy {current_proxy.server} -> Response IP: {ip_match.group(0)}")
verified = ip_match.group(0) == current_proxy.ip
if verified:
print(f"✅ Proxy working! IP matches: {current_proxy.ip}")
else:
print("❌ Proxy failed or IP mismatch!")
print("---")
asyncio.run(main())
if __name__ == "__main__":
import asyncio
asyncio.run(main())
```

View File

@@ -45,7 +45,7 @@ Here's an example of crawling GitHub commits across multiple pages while preserv
```python
from crawl4ai.async_configs import CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai.cache_context import CacheMode
async def crawl_dynamic_content():

View File

@@ -0,0 +1,310 @@
# Virtual Scroll
Modern websites increasingly use **virtual scrolling** (also called windowed rendering or viewport rendering) to handle large datasets efficiently. This technique only renders visible items in the DOM, replacing content as users scroll. Popular examples include Twitter's timeline, Instagram's feed, and many data tables.
Crawl4AI's Virtual Scroll feature automatically detects and handles these scenarios, ensuring you capture **all content**, not just what's initially visible.
## Understanding Virtual Scroll
### The Problem
Traditional infinite scroll **appends** new content to existing content. Virtual scroll **replaces** content to maintain performance:
```
Traditional Scroll: Virtual Scroll:
┌─────────────┐ ┌─────────────┐
│ Item 1 │ │ Item 11 │ <- Items 1-10 removed
│ Item 2 │ │ Item 12 │ <- Only visible items
│ ... │ │ Item 13 │ in DOM
│ Item 10 │ │ Item 14 │
│ Item 11 NEW │ │ Item 15 │
│ Item 12 NEW │ └─────────────┘
└─────────────┘
DOM keeps growing DOM size stays constant
```
Without proper handling, crawlers only capture the currently visible items, missing the rest of the content.
### Three Scrolling Scenarios
Crawl4AI's Virtual Scroll detects and handles three scenarios:
1. **No Change** - Content doesn't update on scroll (static page or end reached)
2. **Content Appended** - New items added to existing ones (traditional infinite scroll)
3. **Content Replaced** - Items replaced with new ones (true virtual scroll)
Only scenario 3 requires special handling, which Virtual Scroll automates.
## Basic Usage
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, VirtualScrollConfig
# Configure virtual scroll
virtual_config = VirtualScrollConfig(
container_selector="#feed", # CSS selector for scrollable container
scroll_count=20, # Number of scrolls to perform
scroll_by="container_height", # How much to scroll each time
wait_after_scroll=0.5 # Wait time (seconds) after each scroll
)
# Use in crawler configuration
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
# result.html contains ALL items from the virtual scroll
```
## Configuration Parameters
### VirtualScrollConfig
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `container_selector` | `str` | Required | CSS selector for the scrollable container |
| `scroll_count` | `int` | `10` | Maximum number of scrolls to perform |
| `scroll_by` | `str` or `int` | `"container_height"` | Scroll amount per step |
| `wait_after_scroll` | `float` | `0.5` | Seconds to wait after each scroll |
### Scroll By Options
- `"container_height"` - Scroll by the container's visible height
- `"page_height"` - Scroll by the viewport height
- `500` (integer) - Scroll by exact pixel amount
## Real-World Examples
### Twitter-like Timeline
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, VirtualScrollConfig, BrowserConfig
async def crawl_twitter_timeline():
# Twitter replaces tweets as you scroll
virtual_config = VirtualScrollConfig(
container_selector="[data-testid='primaryColumn']",
scroll_count=30,
scroll_by="container_height",
wait_after_scroll=1.0 # Twitter needs time to load
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config,
# Optional: Set headless=False to watch it work
# browser_config=BrowserConfig(headless=False)
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://twitter.com/search?q=AI",
config=config
)
# Extract tweet count
import re
tweets = re.findall(r'data-testid="tweet"', result.html)
print(f"Captured {len(tweets)} tweets")
```
### Instagram Grid
```python
async def crawl_instagram_grid():
# Instagram uses virtualized grid for performance
virtual_config = VirtualScrollConfig(
container_selector="article", # Main feed container
scroll_count=50, # More scrolls for grid layout
scroll_by=800, # Fixed pixel scrolling
wait_after_scroll=0.8
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config,
screenshot=True # Capture final state
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.instagram.com/explore/tags/photography/",
config=config
)
# Count posts
posts = result.html.count('class="post"')
print(f"Captured {posts} posts from virtualized grid")
```
### Mixed Content (News Feed)
Some sites mix static and virtualized content:
```python
async def crawl_mixed_feed():
# Featured articles stay, regular articles virtualize
virtual_config = VirtualScrollConfig(
container_selector=".main-feed",
scroll_count=25,
scroll_by="container_height",
wait_after_scroll=0.5
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.example.com",
config=config
)
# Featured articles remain throughout
featured = result.html.count('class="featured-article"')
regular = result.html.count('class="regular-article"')
print(f"Featured (static): {featured}")
print(f"Regular (virtualized): {regular}")
```
## Virtual Scroll vs scan_full_page
Both features handle dynamic content, but serve different purposes:
| Feature | Virtual Scroll | scan_full_page |
|---------|---------------|----------------|
| **Purpose** | Capture content that's replaced during scroll | Load content that's appended during scroll |
| **Use Case** | Twitter, Instagram, virtual tables | Traditional infinite scroll, lazy-loaded images |
| **DOM Behavior** | Replaces elements | Adds elements |
| **Memory Usage** | Efficient (merges content) | Can grow large |
| **Configuration** | Requires container selector | Works on full page |
### When to Use Which?
Use **Virtual Scroll** when:
- Content disappears as you scroll (Twitter timeline)
- DOM element count stays relatively constant
- You need ALL items from a virtualized list
- Container-based scrolling (not full page)
Use **scan_full_page** when:
- Content accumulates as you scroll
- Images load lazily
- Simple "load more" behavior
- Full page scrolling
## Combining with Extraction
Virtual Scroll works seamlessly with extraction strategies:
```python
from crawl4ai import LLMExtractionStrategy
# Define extraction schema
schema = {
"type": "array",
"items": {
"type": "object",
"properties": {
"author": {"type": "string"},
"content": {"type": "string"},
"timestamp": {"type": "string"}
}
}
}
# Configure both virtual scroll and extraction
config = CrawlerRunConfig(
virtual_scroll_config=VirtualScrollConfig(
container_selector="#timeline",
scroll_count=20
),
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
schema=schema
)
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="...", config=config)
# Extracted data from ALL scrolled content
import json
posts = json.loads(result.extracted_content)
print(f"Extracted {len(posts)} posts from virtual scroll")
```
## Performance Tips
1. **Container Selection**: Be specific with selectors. Using the correct container improves performance.
2. **Scroll Count**: Start conservative and increase as needed:
```python
# Start with fewer scrolls
virtual_config = VirtualScrollConfig(
container_selector="#feed",
scroll_count=10 # Test with 10, increase if needed
)
```
3. **Wait Times**: Adjust based on site speed:
```python
# Fast sites
wait_after_scroll=0.2
# Slower sites or heavy content
wait_after_scroll=1.5
```
4. **Debug Mode**: Set `headless=False` to watch scrolling:
```python
browser_config = BrowserConfig(headless=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
# Watch the scrolling happen
```
## How It Works Internally
1. **Detection Phase**: Scrolls and compares HTML to detect behavior
2. **Capture Phase**: For replaced content, stores HTML chunks at each position
3. **Merge Phase**: Combines all chunks, removing duplicates based on text content
4. **Result**: Complete HTML with all unique items
The deduplication uses normalized text (lowercase, no spaces/symbols) to ensure accurate merging without false positives.
## Error Handling
Virtual Scroll handles errors gracefully:
```python
# If container not found or scrolling fails
result = await crawler.arun(url="...", config=config)
if result.success:
# Virtual scroll worked or wasn't needed
print(f"Captured {len(result.html)} characters")
else:
# Crawl failed entirely
print(f"Error: {result.error_message}")
```
If the container isn't found, crawling continues normally without virtual scroll.
## Complete Example
See our [comprehensive example](/docs/examples/virtual_scroll_example.py) that demonstrates:
- Twitter-like feeds
- Instagram grids
- Traditional infinite scroll
- Mixed content scenarios
- Performance comparisons
```bash
# Run the examples
cd docs/examples
python virtual_scroll_example.py
```
The example includes a local test server with different scrolling behaviors for experimentation.

View File

@@ -0,0 +1,244 @@
# AdaptiveCrawler
The `AdaptiveCrawler` class implements intelligent web crawling that automatically determines when sufficient information has been gathered to answer a query. It uses a three-layer scoring system to evaluate coverage, consistency, and saturation.
## Constructor
```python
AdaptiveCrawler(
crawler: AsyncWebCrawler,
config: Optional[AdaptiveConfig] = None
)
```
### Parameters
- **crawler** (`AsyncWebCrawler`): The underlying web crawler instance to use for fetching pages
- **config** (`Optional[AdaptiveConfig]`): Configuration settings for adaptive crawling behavior. If not provided, uses default settings.
## Primary Method
### digest()
The main method that performs adaptive crawling starting from a URL with a specific query.
```python
async def digest(
start_url: str,
query: str,
resume_from: Optional[Union[str, Path]] = None
) -> CrawlState
```
#### Parameters
- **start_url** (`str`): The starting URL for crawling
- **query** (`str`): The search query that guides the crawling process
- **resume_from** (`Optional[Union[str, Path]]`): Path to a saved state file to resume from
#### Returns
- **CrawlState**: The final crawl state containing all crawled URLs, knowledge base, and metrics
#### Example
```python
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
state = await adaptive.digest(
start_url="https://docs.python.org",
query="async context managers"
)
```
## Properties
### confidence
Current confidence score (0-1) indicating information sufficiency.
```python
@property
def confidence(self) -> float
```
### coverage_stats
Dictionary containing detailed coverage statistics.
```python
@property
def coverage_stats(self) -> Dict[str, float]
```
Returns:
- **coverage**: Query term coverage score
- **consistency**: Information consistency score
- **saturation**: Content saturation score
- **confidence**: Overall confidence score
### is_sufficient
Boolean indicating whether sufficient information has been gathered.
```python
@property
def is_sufficient(self) -> bool
```
### state
Access to the current crawl state.
```python
@property
def state(self) -> CrawlState
```
## Methods
### get_relevant_content()
Retrieve the most relevant content from the knowledge base.
```python
def get_relevant_content(
self,
top_k: int = 5
) -> List[Dict[str, Any]]
```
#### Parameters
- **top_k** (`int`): Number of top relevant documents to return (default: 5)
#### Returns
List of dictionaries containing:
- **url**: The URL of the page
- **content**: The page content
- **score**: Relevance score
- **metadata**: Additional page metadata
### print_stats()
Display crawl statistics in formatted output.
```python
def print_stats(
self,
detailed: bool = False
) -> None
```
#### Parameters
- **detailed** (`bool`): If True, shows detailed metrics with colors. If False, shows summary table.
### export_knowledge_base()
Export the collected knowledge base to a JSONL file.
```python
def export_knowledge_base(
self,
path: Union[str, Path]
) -> None
```
#### Parameters
- **path** (`Union[str, Path]`): Output file path for JSONL export
#### Example
```python
adaptive.export_knowledge_base("my_knowledge.jsonl")
```
### import_knowledge_base()
Import a previously exported knowledge base.
```python
def import_knowledge_base(
self,
path: Union[str, Path]
) -> None
```
#### Parameters
- **path** (`Union[str, Path]`): Path to JSONL file to import
## Configuration
The `AdaptiveConfig` class controls the behavior of adaptive crawling:
```python
@dataclass
class AdaptiveConfig:
confidence_threshold: float = 0.8 # Stop when confidence reaches this
max_pages: int = 50 # Maximum pages to crawl
top_k_links: int = 5 # Links to follow per page
min_gain_threshold: float = 0.1 # Minimum expected gain to continue
save_state: bool = False # Auto-save crawl state
state_path: Optional[str] = None # Path for state persistence
```
### Example with Custom Config
```python
config = AdaptiveConfig(
confidence_threshold=0.7,
max_pages=20,
top_k_links=3
)
adaptive = AdaptiveCrawler(crawler, config=config)
```
## Complete Example
```python
import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async def main():
# Configure adaptive crawling
config = AdaptiveConfig(
confidence_threshold=0.75,
max_pages=15,
save_state=True,
state_path="my_crawl.json"
)
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler, config)
# Start crawling
state = await adaptive.digest(
start_url="https://example.com/docs",
query="authentication oauth2 jwt"
)
# Check results
print(f"Confidence achieved: {adaptive.confidence:.0%}")
adaptive.print_stats()
# Get most relevant pages
for page in adaptive.get_relevant_content(top_k=3):
print(f"- {page['url']} (score: {page['score']:.2f})")
# Export for later use
adaptive.export_knowledge_base("auth_knowledge.jsonl")
if __name__ == "__main__":
asyncio.run(main())
```
## See Also
- [digest() Method Reference](digest.md)
- [Adaptive Crawling Guide](../core/adaptive-crawling.md)
- [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md)

View File

@@ -215,7 +215,7 @@ Below is a snippet combining many parameters:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
async def main():
# Example schema

View File

@@ -217,7 +217,7 @@ Below is an example hooking it all together:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
import json
async def main():

View File

@@ -0,0 +1,992 @@
# C4A-Script API Reference
Complete reference for all C4A-Script commands, syntax, and advanced features.
## Command Categories
### 🧭 Navigation Commands
Navigate between pages and manage browser history.
#### `GO <url>`
Navigate to a specific URL.
**Syntax:**
```c4a
GO <url>
```
**Parameters:**
- `url` - Target URL (string)
**Examples:**
```c4a
GO https://example.com
GO https://api.example.com/login
GO /relative/path
```
**Notes:**
- Supports both absolute and relative URLs
- Automatically handles protocol detection
- Waits for page load to complete
---
#### `RELOAD`
Refresh the current page.
**Syntax:**
```c4a
RELOAD
```
**Examples:**
```c4a
RELOAD
```
**Notes:**
- Equivalent to pressing F5 or clicking browser refresh
- Waits for page reload to complete
- Preserves current URL
---
#### `BACK`
Navigate back in browser history.
**Syntax:**
```c4a
BACK
```
**Examples:**
```c4a
BACK
```
**Notes:**
- Equivalent to clicking browser back button
- Does nothing if no previous page exists
- Waits for navigation to complete
---
#### `FORWARD`
Navigate forward in browser history.
**Syntax:**
```c4a
FORWARD
```
**Examples:**
```c4a
FORWARD
```
**Notes:**
- Equivalent to clicking browser forward button
- Does nothing if no next page exists
- Waits for navigation to complete
### ⏱️ Wait Commands
Control timing and synchronization with page elements.
#### `WAIT <time>`
Wait for a specified number of seconds.
**Syntax:**
```c4a
WAIT <seconds>
```
**Parameters:**
- `seconds` - Number of seconds to wait (number)
**Examples:**
```c4a
WAIT 3
WAIT 1.5
WAIT 10
```
**Notes:**
- Accepts decimal values
- Useful for giving dynamic content time to load
- Non-blocking for other browser operations
---
#### `WAIT <selector> <timeout>`
Wait for an element to appear on the page.
**Syntax:**
```c4a
WAIT `<selector>` <timeout>
```
**Parameters:**
- `selector` - CSS selector for the element (string in backticks)
- `timeout` - Maximum seconds to wait (number)
**Examples:**
```c4a
WAIT `#content` 10
WAIT `.loading-spinner` 5
WAIT `button[type="submit"]` 15
WAIT `.results .item:first-child` 8
```
**Notes:**
- Fails if element doesn't appear within timeout
- More reliable than fixed time waits
- Supports complex CSS selectors
---
#### `WAIT "<text>" <timeout>`
Wait for specific text to appear anywhere on the page.
**Syntax:**
```c4a
WAIT "<text>" <timeout>
```
**Parameters:**
- `text` - Text content to wait for (string in quotes)
- `timeout` - Maximum seconds to wait (number)
**Examples:**
```c4a
WAIT "Loading complete" 10
WAIT "Welcome back" 5
WAIT "Search results" 15
```
**Notes:**
- Case-sensitive text matching
- Searches entire page content
- Useful for dynamic status messages
### 🖱️ Mouse Commands
Simulate mouse interactions and movements.
#### `CLICK <selector>`
Click on an element specified by CSS selector.
**Syntax:**
```c4a
CLICK `<selector>`
```
**Parameters:**
- `selector` - CSS selector for the element (string in backticks)
**Examples:**
```c4a
CLICK `#submit-button`
CLICK `.menu-item:first-child`
CLICK `button[data-action="save"]`
CLICK `a[href="/dashboard"]`
```
**Notes:**
- Waits for element to be clickable
- Scrolls element into view if necessary
- Handles overlapping elements intelligently
---
#### `CLICK <x> <y>`
Click at specific coordinates on the page.
**Syntax:**
```c4a
CLICK <x> <y>
```
**Parameters:**
- `x` - X coordinate in pixels (number)
- `y` - Y coordinate in pixels (number)
**Examples:**
```c4a
CLICK 100 200
CLICK 500 300
CLICK 0 0
```
**Notes:**
- Coordinates are relative to viewport
- Useful when element selectors are unreliable
- Consider responsive design implications
---
#### `DOUBLE_CLICK <selector>`
Double-click on an element.
**Syntax:**
```c4a
DOUBLE_CLICK `<selector>`
```
**Parameters:**
- `selector` - CSS selector for the element (string in backticks)
**Examples:**
```c4a
DOUBLE_CLICK `.file-icon`
DOUBLE_CLICK `#editable-cell`
DOUBLE_CLICK `.expandable-item`
```
**Notes:**
- Triggers dblclick event
- Common for opening files or editing inline content
- Timing between clicks is automatically handled
---
#### `RIGHT_CLICK <selector>`
Right-click on an element to open context menu.
**Syntax:**
```c4a
RIGHT_CLICK `<selector>`
```
**Parameters:**
- `selector` - CSS selector for the element (string in backticks)
**Examples:**
```c4a
RIGHT_CLICK `#context-target`
RIGHT_CLICK `.menu-trigger`
RIGHT_CLICK `img.thumbnail`
```
**Notes:**
- Opens browser/application context menu
- Useful for testing context menu interactions
- May be blocked by some applications
---
#### `SCROLL <direction> <amount>`
Scroll the page in a specified direction.
**Syntax:**
```c4a
SCROLL <direction> <amount>
```
**Parameters:**
- `direction` - Direction to scroll: `UP`, `DOWN`, `LEFT`, `RIGHT`
- `amount` - Number of pixels to scroll (number)
**Examples:**
```c4a
SCROLL DOWN 500
SCROLL UP 200
SCROLL LEFT 100
SCROLL RIGHT 300
```
**Notes:**
- Smooth scrolling animation
- Useful for infinite scroll pages
- Amount can be larger than viewport
---
#### `MOVE <x> <y>`
Move mouse cursor to specific coordinates.
**Syntax:**
```c4a
MOVE <x> <y>
```
**Parameters:**
- `x` - X coordinate in pixels (number)
- `y` - Y coordinate in pixels (number)
**Examples:**
```c4a
MOVE 200 100
MOVE 500 400
```
**Notes:**
- Triggers hover effects
- Useful for testing mouseover interactions
- Does not click, only moves cursor
---
#### `DRAG <x1> <y1> <x2> <y2>`
Drag from one point to another.
**Syntax:**
```c4a
DRAG <x1> <y1> <x2> <y2>
```
**Parameters:**
- `x1`, `y1` - Starting coordinates (numbers)
- `x2`, `y2` - Ending coordinates (numbers)
**Examples:**
```c4a
DRAG 100 100 500 300
DRAG 0 200 400 200
```
**Notes:**
- Simulates click, drag, and release
- Useful for sliders, resizing, reordering
- Smooth drag animation
### ⌨️ Keyboard Commands
Simulate keyboard input and key presses.
#### `TYPE "<text>"`
Type text into the currently focused element.
**Syntax:**
```c4a
TYPE "<text>"
```
**Parameters:**
- `text` - Text to type (string in quotes)
**Examples:**
```c4a
TYPE "Hello, World!"
TYPE "user@example.com"
TYPE "Password123!"
```
**Notes:**
- Requires an input element to be focused
- Types character by character with realistic timing
- Supports special characters and Unicode
---
#### `TYPE $<variable>`
Type the value of a variable.
**Syntax:**
```c4a
TYPE $<variable>
```
**Parameters:**
- `variable` - Variable name (without quotes)
**Examples:**
```c4a
SETVAR email = "user@example.com"
TYPE $email
```
**Notes:**
- Variable must be defined with SETVAR first
- Variable values are strings
- Useful for reusable credentials or data
---
#### `PRESS <key>`
Press and release a special key.
**Syntax:**
```c4a
PRESS <key>
```
**Parameters:**
- `key` - Key name (see supported keys below)
**Supported Keys:**
- `Tab`, `Enter`, `Escape`, `Space`
- `ArrowUp`, `ArrowDown`, `ArrowLeft`, `ArrowRight`
- `Delete`, `Backspace`
- `Home`, `End`, `PageUp`, `PageDown`
**Examples:**
```c4a
PRESS Tab
PRESS Enter
PRESS Escape
PRESS ArrowDown
```
**Notes:**
- Simulates actual key press and release
- Useful for form navigation and shortcuts
- Case-sensitive key names
---
#### `KEY_DOWN <key>`
Hold down a modifier key.
**Syntax:**
```c4a
KEY_DOWN <key>
```
**Parameters:**
- `key` - Modifier key: `Shift`, `Control`, `Alt`, `Meta`
**Examples:**
```c4a
KEY_DOWN Shift
KEY_DOWN Control
```
**Notes:**
- Must be paired with KEY_UP
- Useful for key combinations
- Meta key is Cmd on Mac, Windows key on PC
---
#### `KEY_UP <key>`
Release a modifier key.
**Syntax:**
```c4a
KEY_UP <key>
```
**Parameters:**
- `key` - Modifier key: `Shift`, `Control`, `Alt`, `Meta`
**Examples:**
```c4a
KEY_UP Shift
KEY_UP Control
```
**Notes:**
- Must be paired with KEY_DOWN
- Releases the specified modifier key
- Good practice to always release held keys
---
#### `CLEAR <selector>`
Clear the content of an input field.
**Syntax:**
```c4a
CLEAR `<selector>`
```
**Parameters:**
- `selector` - CSS selector for input element (string in backticks)
**Examples:**
```c4a
CLEAR `#search-box`
CLEAR `input[name="email"]`
CLEAR `.form-input:first-child`
```
**Notes:**
- Works with input, textarea elements
- Faster than selecting all and deleting
- Triggers appropriate change events
---
#### `SET <selector> "<value>"`
Set the value of an input field directly.
**Syntax:**
```c4a
SET `<selector>` "<value>"
```
**Parameters:**
- `selector` - CSS selector for input element (string in backticks)
- `value` - Value to set (string in quotes)
**Examples:**
```c4a
SET `#email` "user@example.com"
SET `#age` "25"
SET `textarea#message` "Hello, this is a test message."
```
**Notes:**
- Directly sets value without typing animation
- Faster than TYPE for long text
- Triggers change and input events
### 🔀 Control Flow Commands
Add conditional logic and loops to your scripts.
#### `IF (EXISTS <selector>) THEN <command>`
Execute command if element exists.
**Syntax:**
```c4a
IF (EXISTS `<selector>`) THEN <command>
```
**Parameters:**
- `selector` - CSS selector to check (string in backticks)
- `command` - Command to execute if condition is true
**Examples:**
```c4a
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-cookies`
IF (EXISTS `#popup-modal`) THEN CLICK `.close-button`
IF (EXISTS `.error-message`) THEN RELOAD
```
**Notes:**
- Checks for element existence at time of execution
- Does not wait for element to appear
- Can be combined with ELSE
---
#### `IF (EXISTS <selector>) THEN <command> ELSE <command>`
Execute command based on element existence.
**Syntax:**
```c4a
IF (EXISTS `<selector>`) THEN <command> ELSE <command>
```
**Parameters:**
- `selector` - CSS selector to check (string in backticks)
- First `command` - Execute if condition is true
- Second `command` - Execute if condition is false
**Examples:**
```c4a
IF (EXISTS `.user-menu`) THEN CLICK `.logout` ELSE CLICK `.login`
IF (EXISTS `.loading`) THEN WAIT 5 ELSE CLICK `#continue`
```
**Notes:**
- Exactly one command will be executed
- Useful for handling different page states
- Commands must be on same line
---
#### `IF (NOT EXISTS <selector>) THEN <command>`
Execute command if element does not exist.
**Syntax:**
```c4a
IF (NOT EXISTS `<selector>`) THEN <command>
```
**Parameters:**
- `selector` - CSS selector to check (string in backticks)
- `command` - Command to execute if element doesn't exist
**Examples:**
```c4a
IF (NOT EXISTS `.logged-in`) THEN GO /login
IF (NOT EXISTS `.results`) THEN CLICK `#search-button`
```
**Notes:**
- Inverse of EXISTS condition
- Useful for error handling
- Can check for missing required elements
---
#### `IF (<javascript>) THEN <command>`
Execute command based on JavaScript condition.
**Syntax:**
```c4a
IF (`<javascript>`) THEN <command>
```
**Parameters:**
- `javascript` - JavaScript expression that returns boolean (string in backticks)
- `command` - Command to execute if condition is true
**Examples:**
```c4a
IF (`window.innerWidth < 768`) THEN CLICK `.mobile-menu`
IF (`document.readyState === "complete"`) THEN CLICK `#start`
IF (`localStorage.getItem("user")`) THEN GO /dashboard
```
**Notes:**
- JavaScript executes in browser context
- Must return boolean value
- Access to all browser APIs and globals
---
#### `REPEAT (<command>, <count>)`
Repeat a command a specific number of times.
**Syntax:**
```c4a
REPEAT (<command>, <count>)
```
**Parameters:**
- `command` - Command to repeat
- `count` - Number of times to repeat (number)
**Examples:**
```c4a
REPEAT (SCROLL DOWN 300, 5)
REPEAT (PRESS Tab, 3)
REPEAT (CLICK `.load-more`, 10)
```
**Notes:**
- Executes command exactly count times
- Useful for pagination, scrolling, navigation
- No delay between repetitions (add WAIT if needed)
---
#### `REPEAT (<command>, <condition>)`
Repeat a command while condition is true.
**Syntax:**
```c4a
REPEAT (<command>, `<condition>`)
```
**Parameters:**
- `command` - Command to repeat
- `condition` - JavaScript condition to check (string in backticks)
**Examples:**
```c4a
REPEAT (SCROLL DOWN 500, `document.querySelector(".load-more")`)
REPEAT (PRESS ArrowDown, `window.scrollY < document.body.scrollHeight`)
```
**Notes:**
- Condition checked before each iteration
- JavaScript condition must return boolean
- Be careful to avoid infinite loops
### 💾 Variables and Data
Store and manipulate data within scripts.
#### `SETVAR <name> = "<value>"`
Create or update a variable.
**Syntax:**
```c4a
SETVAR <name> = "<value>"
```
**Parameters:**
- `name` - Variable name (alphanumeric, underscore)
- `value` - Variable value (string in quotes)
**Examples:**
```c4a
SETVAR username = "john@example.com"
SETVAR password = "secret123"
SETVAR base_url = "https://api.example.com"
SETVAR counter = "0"
```
**Notes:**
- Variables are global within script scope
- Values are always strings
- Can be used with TYPE command using $variable syntax
---
#### `EVAL <javascript>`
Execute arbitrary JavaScript code.
**Syntax:**
```c4a
EVAL `<javascript>`
```
**Parameters:**
- `javascript` - JavaScript code to execute (string in backticks)
**Examples:**
```c4a
EVAL `console.log("Script started")`
EVAL `window.scrollTo(0, 0)`
EVAL `localStorage.setItem("test", "value")`
EVAL `document.title = "Automated Test"`
```
**Notes:**
- Full access to browser JavaScript APIs
- Useful for custom logic and debugging
- Return values are not captured
- Be careful with security implications
### 📝 Comments and Documentation
#### `# <comment>`
Add comments to scripts for documentation.
**Syntax:**
```c4a
# <comment text>
```
**Examples:**
```c4a
# This script logs into the application
# Step 1: Navigate to login page
GO /login
# Step 2: Fill credentials
TYPE "user@example.com"
```
**Notes:**
- Comments are ignored during execution
- Useful for documentation and debugging
- Can appear anywhere in script
- Supports multi-line documentation blocks
### 🔧 Procedures (Advanced)
Define reusable command sequences.
#### `PROC <name> ... ENDPROC`
Define a reusable procedure.
**Syntax:**
```c4a
PROC <name>
<commands>
ENDPROC
```
**Parameters:**
- `name` - Procedure name (alphanumeric, underscore)
- `commands` - Commands to include in procedure
**Examples:**
```c4a
PROC login
CLICK `#email`
TYPE $email
CLICK `#password`
TYPE $password
CLICK `#submit`
ENDPROC
PROC handle_popups
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
IF (EXISTS `.newsletter-modal`) THEN CLICK `.close`
ENDPROC
```
**Notes:**
- Procedures must be defined before use
- Support nested command structures
- Variables are shared with main script scope
---
#### `<procedure_name>`
Call a defined procedure.
**Syntax:**
```c4a
<procedure_name>
```
**Examples:**
```c4a
# Define procedure first
PROC setup
GO /login
WAIT `#form` 5
ENDPROC
# Call procedure
setup
login
```
**Notes:**
- Procedure must be defined before calling
- Can be called multiple times
- No parameters supported (use variables instead)
## Error Handling Best Practices
### 1. Always Use Waits
```c4a
# Bad - element might not be ready
CLICK `#button`
# Good - wait for element first
WAIT `#button` 5
CLICK `#button`
```
### 2. Handle Optional Elements
```c4a
# Check before interacting
IF (EXISTS `.popup`) THEN CLICK `.close`
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
# Then proceed with main flow
CLICK `#main-action`
```
### 3. Use Descriptive Variables
```c4a
# Set up reusable data
SETVAR admin_email = "admin@company.com"
SETVAR test_password = "TestPass123!"
SETVAR staging_url = "https://staging.example.com"
# Use throughout script
GO $staging_url
TYPE $admin_email
```
### 4. Add Debugging Information
```c4a
# Log progress
EVAL `console.log("Starting login process")`
GO /login
# Verify page state
IF (`document.title.includes("Login")`) THEN EVAL `console.log("On login page")`
# Continue with login
TYPE $username
```
## Common Patterns
### Login Flow
```c4a
# Complete login automation
SETVAR email = "user@example.com"
SETVAR password = "mypassword"
GO /login
WAIT `#login-form` 5
# Handle optional cookie banner
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-cookies`
# Fill and submit form
CLICK `#email`
TYPE $email
PRESS Tab
TYPE $password
CLICK `button[type="submit"]`
# Wait for redirect
WAIT `.dashboard` 10
```
### Infinite Scroll
```c4a
# Load all content with infinite scroll
GO /products
# Scroll and load more content
REPEAT (SCROLL DOWN 500, `document.querySelector(".load-more")`)
# Alternative: Fixed number of scrolls
REPEAT (SCROLL DOWN 800, 10)
WAIT 2
```
### Form Validation
```c4a
# Handle form with validation
SET `#email` "invalid-email"
CLICK `#submit`
# Check for validation error
IF (EXISTS `.error-email`) THEN SET `#email` "valid@example.com"
# Retry submission
CLICK `#submit`
WAIT `.success-message` 5
```
### Multi-step Process
```c4a
# Complex multi-step workflow
PROC navigate_to_step
CLICK `.next-button`
WAIT `.step-content` 5
ENDPROC
# Step 1
WAIT `.step-1` 5
SET `#name` "John Doe"
navigate_to_step
# Step 2
SET `#email` "john@example.com"
navigate_to_step
# Step 3
CLICK `#submit-final`
WAIT `.confirmation` 10
```
## Integration with Crawl4AI
Use C4A-Script with Crawl4AI for dynamic content interaction:
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
# Define interaction script
script = """
# Handle dynamic content loading
WAIT `.content` 5
IF (EXISTS `.load-more-button`) THEN CLICK `.load-more-button`
WAIT `.additional-content` 5
# Accept cookies if needed
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-all`
"""
config = CrawlerRunConfig(
c4a_script=script,
wait_for=".content",
screenshot=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
print(result.markdown)
```
This reference covers all available C4A-Script commands and patterns. For interactive learning, try the [tutorial](../examples/c4a_script/tutorial/) or [live demo](https://docs.crawl4ai.com/c4a-script/demo).

181
docs/md_v2/api/digest.md Normal file
View File

@@ -0,0 +1,181 @@
# digest()
The `digest()` method is the primary interface for adaptive web crawling. It intelligently crawls websites starting from a given URL, guided by a query, and automatically determines when sufficient information has been gathered.
## Method Signature
```python
async def digest(
start_url: str,
query: str,
resume_from: Optional[Union[str, Path]] = None
) -> CrawlState
```
## Parameters
### start_url
- **Type**: `str`
- **Required**: Yes
- **Description**: The starting URL for the crawl. This should be a valid HTTP/HTTPS URL that serves as the entry point for information gathering.
### query
- **Type**: `str`
- **Required**: Yes
- **Description**: The search query that guides the crawling process. This should contain key terms related to the information you're seeking. The crawler uses this to evaluate relevance and determine which links to follow.
### resume_from
- **Type**: `Optional[Union[str, Path]]`
- **Default**: `None`
- **Description**: Path to a previously saved crawl state file. When provided, the crawler resumes from the saved state instead of starting fresh.
## Return Value
Returns a `CrawlState` object containing:
- **crawled_urls** (`Set[str]`): All URLs that have been crawled
- **knowledge_base** (`List[CrawlResult]`): Collection of crawled pages with content
- **pending_links** (`List[Link]`): Links discovered but not yet crawled
- **metrics** (`Dict[str, float]`): Performance and quality metrics
- **query** (`str`): The original query
- Additional statistical information for scoring
## How It Works
The `digest()` method implements an intelligent crawling algorithm:
1. **Initial Crawl**: Starts from the provided URL
2. **Link Analysis**: Evaluates all discovered links for relevance
3. **Scoring**: Uses three metrics to assess information sufficiency:
- **Coverage**: How well the query terms are covered
- **Consistency**: Information coherence across pages
- **Saturation**: Diminishing returns detection
4. **Adaptive Selection**: Chooses the most promising links to follow
5. **Stopping Decision**: Automatically stops when confidence threshold is reached
## Examples
### Basic Usage
```python
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
state = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="async await context managers"
)
print(f"Crawled {len(state.crawled_urls)} pages")
print(f"Confidence: {adaptive.confidence:.0%}")
```
### With Configuration
```python
config = AdaptiveConfig(
confidence_threshold=0.9, # Require high confidence
max_pages=30, # Allow more pages
top_k_links=3 # Follow top 3 links per page
)
adaptive = AdaptiveCrawler(crawler, config=config)
state = await adaptive.digest(
start_url="https://api.example.com/docs",
query="authentication endpoints rate limits"
)
```
### Resuming a Previous Crawl
```python
# First crawl - may be interrupted
state1 = await adaptive.digest(
start_url="https://example.com",
query="machine learning algorithms"
)
# Save state (if not auto-saved)
state1.save("ml_crawl_state.json")
# Later, resume from saved state
state2 = await adaptive.digest(
start_url="https://example.com",
query="machine learning algorithms",
resume_from="ml_crawl_state.json"
)
```
### With Progress Monitoring
```python
state = await adaptive.digest(
start_url="https://docs.example.com",
query="api reference"
)
# Monitor progress
print(f"Pages crawled: {len(state.crawled_urls)}")
print(f"New terms discovered: {state.new_terms_history}")
print(f"Final confidence: {adaptive.confidence:.2%}")
# View detailed statistics
adaptive.print_stats(detailed=True)
```
## Query Best Practices
1. **Be Specific**: Use descriptive terms that appear in target content
```python
# Good
query = "python async context managers implementation"
# Too broad
query = "python programming"
```
2. **Include Key Terms**: Add technical terms you expect to find
```python
query = "oauth2 jwt refresh tokens authorization"
```
3. **Multiple Concepts**: Combine related concepts for comprehensive coverage
```python
query = "rest api pagination sorting filtering"
```
## Performance Considerations
- **Initial URL**: Choose a page with good navigation (e.g., documentation index)
- **Query Length**: 3-8 terms typically work best
- **Link Density**: Sites with clear navigation crawl more efficiently
- **Caching**: Enable caching for repeated crawls of the same domain
## Error Handling
```python
try:
state = await adaptive.digest(
start_url="https://example.com",
query="search terms"
)
except Exception as e:
print(f"Crawl failed: {e}")
# State is auto-saved if save_state=True in config
```
## Stopping Conditions
The crawl stops when any of these conditions are met:
1. **Confidence Threshold**: Reached the configured confidence level
2. **Page Limit**: Crawled the maximum number of pages
3. **Diminishing Returns**: Expected information gain below threshold
4. **No Relevant Links**: No promising links remain to follow
## See Also
- [AdaptiveCrawler Class](adaptive-crawler.md)
- [Adaptive Crawling Guide](../core/adaptive-crawling.md)
- [Configuration Options](../core/adaptive-crawling.md#configuration-options)

View File

@@ -169,7 +169,46 @@ Use these for link-level content filtering (often to keep crawls “internal”
---
## 2.2 Helper Methods
### H) **Virtual Scroll Configuration**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **`virtual_scroll_config`** | `VirtualScrollConfig or dict` (None) | Configuration for handling virtualized scrolling on sites like Twitter/Instagram where content is replaced rather than appended. |
When sites use virtual scrolling (content replaced as you scroll), use `VirtualScrollConfig`:
```python
from crawl4ai import VirtualScrollConfig
virtual_config = VirtualScrollConfig(
container_selector="#timeline", # CSS selector for scrollable container
scroll_count=30, # Number of times to scroll
scroll_by="container_height", # How much to scroll: "container_height", "page_height", or pixels (e.g. 500)
wait_after_scroll=0.5 # Seconds to wait after each scroll for content to load
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config
)
```
**VirtualScrollConfig Parameters:**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------|---------------------------|-------------------------------------------------------------------------------------------|
| **`container_selector`** | `str` (required) | CSS selector for the scrollable container (e.g., `"#feed"`, `".timeline"`) |
| **`scroll_count`** | `int` (10) | Maximum number of scrolls to perform |
| **`scroll_by`** | `str or int` ("container_height") | Scroll amount: `"container_height"`, `"page_height"`, or pixels (e.g., `500`) |
| **`wait_after_scroll`** | `float` (0.5) | Time in seconds to wait after each scroll for new content to load |
**When to use Virtual Scroll vs scan_full_page:**
- Use `virtual_scroll_config` when content is **replaced** during scroll (Twitter, Instagram)
- Use `scan_full_page` when content is **appended** during scroll (traditional infinite scroll)
See [Virtual Scroll documentation](../../advanced/virtual-scroll.md) for detailed examples.
---## 2.2 Helper Methods
Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies:
@@ -259,7 +298,7 @@ LLMConfig is useful to pass LLM provider config to strategies and functions that
## 3.1 Parameters
| **Parameter** | **Type / Default** | **What It Does** |
|-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| **`provider`** | `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)* | Which LLM provoder to use.
| **`provider`** | `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)* | Which LLM provider to use.
| **`api_token`** |1.Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables <br/> 2. API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"` <br/> 3. Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"` | API token to use for the given provider
| **`base_url`** |Optional. Custom API endpoint | If your provider has a custom endpoint

View File

@@ -169,7 +169,7 @@ OverlappingWindowChunking(
```python
from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import LLMExtractionStrategy
from crawl4ai import LLMConfig
# Define schema
@@ -247,7 +247,7 @@ async with AsyncWebCrawler() as crawler:
### CSS Extraction
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
# Define schema
schema = {

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1,396 @@
# C4A-Script Interactive Tutorial
A comprehensive web-based tutorial for learning and experimenting with C4A-Script - Crawl4AI's visual web automation language.
## 🚀 Quick Start
### Prerequisites
- Python 3.7+
- Modern web browser (Chrome, Firefox, Safari, Edge)
### Running the Tutorial
1. **Clone and Navigate**
```bash
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai/docs/examples/c4a_script/tutorial/
```
2. **Install Dependencies**
```bash
pip install flask
```
3. **Launch the Server**
```bash
python server.py
```
4. **Open in Browser**
```
http://localhost:8080
```
**🌐 Try Online**: [Live Demo](https://docs.crawl4ai.com/c4a-script/demo)
### 2. Try Your First Script
```c4a
# Basic interaction
GO playground/
WAIT `body` 2
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
CLICK `#start-tutorial`
```
## 🎯 What You'll Learn
### Core Features
- **📝 Text Editor**: Write C4A-Script with syntax highlighting
- **🧩 Visual Editor**: Build scripts using drag-and-drop Blockly interface
- **🎬 Recording Mode**: Capture browser actions and auto-generate scripts
- **⚡ Live Execution**: Run scripts in real-time with instant feedback
- **📊 Timeline View**: Visualize and edit automation steps
## 📚 Tutorial Content
### Basic Commands
- **Navigation**: `GO url`
- **Waiting**: `WAIT selector timeout` or `WAIT seconds`
- **Clicking**: `CLICK selector`
- **Typing**: `TYPE "text"`
- **Scrolling**: `SCROLL DOWN/UP amount`
### Control Flow
- **Conditionals**: `IF (condition) THEN action`
- **Loops**: `REPEAT (action, condition)`
- **Procedures**: Define reusable command sequences
### Advanced Features
- **JavaScript evaluation**: `EVAL code`
- **Variables**: `SET name = "value"`
- **Complex selectors**: CSS selectors in backticks
## 🎮 Interactive Playground Features
The tutorial includes a fully interactive web app with:
### 1. **Authentication System**
- Login form with validation
- Session management
- Protected content
### 2. **Dynamic Content**
- Infinite scroll products
- Pagination controls
- Load more buttons
### 3. **Complex Forms**
- Multi-step wizards
- Dynamic field visibility
- Form validation
### 4. **Interactive Elements**
- Tabs and accordions
- Modals and popups
- Expandable content
### 5. **Data Tables**
- Sortable columns
- Search functionality
- Export options
## 🛠️ Tutorial Features
### Live Code Editor
- Syntax highlighting
- Real-time compilation
- Error messages with suggestions
### JavaScript Output Viewer
- See generated JavaScript code
- Edit and test JS directly
- Understand the compilation
### Visual Execution
- Step-by-step progress
- Element highlighting
- Console output
### Example Scripts
Load pre-written examples demonstrating:
- Cookie banner handling
- Login workflows
- Infinite scroll automation
- Multi-step form completion
- Complex interaction sequences
## 📖 Tutorial Sections
### 1. Getting Started
Learn basic commands and syntax:
```c4a
GO https://example.com
WAIT `.content` 5
CLICK `.button`
```
### 2. Handling Dynamic Content
Master waiting strategies and conditionals:
```c4a
IF (EXISTS `.popup`) THEN CLICK `.close`
WAIT `.results` 10
```
### 3. Form Automation
Fill and submit forms:
```c4a
CLICK `#email`
TYPE "user@example.com"
CLICK `button[type="submit"]`
```
### 4. Advanced Workflows
Build complex automation flows:
```c4a
PROC login
CLICK `#username`
TYPE $username
CLICK `#password`
TYPE $password
CLICK `#login-btn`
ENDPROC
SET username = "demo"
SET password = "pass123"
login
```
## 🎯 Practice Challenges
### Challenge 1: Cookie & Popups
Handle the cookie banner and newsletter popup that appear on page load.
### Challenge 2: Complete Login
Successfully log into the application using the demo credentials.
### Challenge 3: Load All Products
Use infinite scroll to load all 100 products in the catalog.
### Challenge 4: Multi-step Survey
Complete the entire multi-step survey form.
### Challenge 5: Full Workflow
Create a script that logs in, browses products, and exports data.
## 💡 Tips & Tricks
### 1. Use Specific Selectors
```c4a
# Good - specific
CLICK `button.submit-order`
# Bad - too generic
CLICK `button`
```
### 2. Always Handle Popups
```c4a
# Check for common popups
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
IF (EXISTS `.newsletter-modal`) THEN CLICK `.close`
```
### 3. Add Appropriate Waits
```c4a
# Wait for elements before interacting
WAIT `.form` 5
CLICK `#submit`
```
### 4. Use Procedures for Reusability
```c4a
PROC handle_popups
IF (EXISTS `.popup`) THEN CLICK `.close`
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
ENDPROC
# Use anywhere
handle_popups
```
## 🔧 Troubleshooting
### Common Issues
1. **"Element not found"**
- Add a WAIT before clicking
- Check selector specificity
- Verify element exists with IF
2. **"Timeout waiting for selector"**
- Increase timeout value
- Check if element is dynamically loaded
- Verify selector is correct
3. **"Missing THEN keyword"**
- All IF statements need THEN
- Format: `IF (condition) THEN action`
## 🚀 Using with Crawl4AI
Once you've mastered C4A-Script in the tutorial, use it with Crawl4AI:
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
config = CrawlerRunConfig(
url="https://example.com",
c4a_script="""
WAIT `.content` 5
IF (EXISTS `.load-more`) THEN CLICK `.load-more`
WAIT `.new-content` 3
"""
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(config=config)
```
## 📝 Example Scripts
Check the `scripts/` folder for complete examples:
- `01-basic-interaction.c4a` - Getting started
- `02-login-flow.c4a` - Authentication
- `03-infinite-scroll.c4a` - Dynamic content
- `04-multi-step-form.c4a` - Complex forms
- `05-complex-workflow.c4a` - Full automation
## 🏗️ Developer Guide
### Project Architecture
```
tutorial/
├── server.py # Flask application server
├── assets/ # Tutorial-specific assets
│ ├── app.js # Main application logic
│ ├── c4a-blocks.js # Custom Blockly blocks
│ ├── c4a-generator.js # Code generation
│ ├── blockly-manager.js # Blockly integration
│ └── styles.css # Main styling
├── playground/ # Interactive demo environment
│ ├── index.html # Demo web application
│ ├── app.js # Demo app logic
│ └── styles.css # Demo styling
├── scripts/ # Example C4A scripts
└── index.html # Main tutorial interface
```
### Key Components
#### 1. TutorialApp (`assets/app.js`)
Main application controller managing:
- Code editor integration (CodeMirror)
- Script execution and browser preview
- Tutorial navigation and lessons
- State management and persistence
#### 2. BlocklyManager (`assets/blockly-manager.js`)
Visual programming interface:
- Custom C4A-Script block definitions
- Bidirectional sync between visual blocks and text
- Real-time code generation
- Dark theme integration
#### 3. Recording System
Powers the recording functionality:
- Browser event capture
- Smart event grouping and filtering
- Automatic C4A-Script generation
- Timeline visualization
### Customization
#### Adding New Commands
1. **Define Block** (`assets/c4a-blocks.js`)
2. **Add Generator** (`assets/c4a-generator.js`)
3. **Update Parser** (`assets/blockly-manager.js`)
#### Themes and Styling
- Main styles: `assets/styles.css`
- Theme variables: CSS custom properties
- Dark mode: Auto-applied based on system preference
### Configuration
```python
# server.py configuration
PORT = 8080
DEBUG = True
THREADED = True
```
### API Endpoints
- `GET /` - Main tutorial interface
- `GET /playground/` - Interactive demo environment
- `POST /execute` - Script execution endpoint
- `GET /examples/<script>` - Load example scripts
## 🔧 Troubleshooting
### Common Issues
**Port Already in Use**
```bash
# Kill existing process
lsof -ti:8080 | xargs kill -9
# Or use different port
python server.py --port 8081
```
**Blockly Not Loading**
- Check browser console for JavaScript errors
- Verify all static files are served correctly
- Ensure proper script loading order
**Recording Issues**
- Verify iframe permissions
- Check cross-origin communication
- Ensure event listeners are attached
### Debug Mode
Enable detailed logging by setting `DEBUG = True` in `assets/app.js`
## 📚 Additional Resources
- **[C4A-Script Documentation](../../md_v2/core/c4a-script.md)** - Complete language guide
- **[API Reference](../../md_v2/api/c4a-script-reference.md)** - Detailed command documentation
- **[Live Demo](https://docs.crawl4ai.com/c4a-script/demo)** - Try without installation
- **[Example Scripts](../)** - More automation examples
## 🤝 Contributing
### Bug Reports
1. Check existing issues on GitHub
2. Provide minimal reproduction steps
3. Include browser and system information
4. Add relevant console logs
### Feature Requests
1. Fork the repository
2. Create feature branch: `git checkout -b feature/my-feature`
3. Test thoroughly with different browsers
4. Update documentation
5. Submit pull request
### Code Style
- Use consistent indentation (2 spaces for JS, 4 for Python)
- Add comments for complex logic
- Follow existing naming conventions
- Test with multiple browsers
---
**Happy Automating!** 🎉
Need help? Check our [documentation](https://docs.crawl4ai.com) or open an issue on [GitHub](https://github.com/unclecode/crawl4ai).

Binary file not shown.

View File

@@ -0,0 +1,906 @@
/* ================================================================
C4A-Script Tutorial - App Layout CSS
Terminal theme with Dank Mono font
================================================================ */
/* CSS Variables */
:root {
--bg-primary: #070708;
--bg-secondary: #0e0e10;
--bg-tertiary: #1a1a1b;
--border-color: #2a2a2c;
--border-hover: #3a3a3c;
--text-primary: #e0e0e0;
--text-secondary: #8b8b8d;
--text-muted: #606065;
--primary-color: #0fbbaa;
--primary-hover: #0da89a;
--primary-dim: #0a8577;
--error-color: #ff5555;
--warning-color: #ffb86c;
--success-color: #50fa7b;
--info-color: #8be9fd;
--code-bg: #1e1e20;
--modal-overlay: rgba(0, 0, 0, 0.8);
}
/* Base Reset */
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
/* Fonts */
@font-face {
font-family: 'Dank Mono';
src: url('DankMono-Regular.woff2') format('woff2');
font-weight: 400;
font-style: normal;
}
@font-face {
font-family: 'Dank Mono';
src: url('DankMono-Bold.woff2') format('woff2');
font-weight: 700;
font-style: normal;
}
@font-face {
font-family: 'Dank Mono';
src: url('DankMono-Italic.woff2') format('woff2');
font-weight: 400;
font-style: italic;
}
/* Body & App Container */
body {
font-family: 'Dank Mono', 'Monaco', 'Consolas', monospace;
background: var(--bg-primary);
color: var(--text-primary);
font-size: 14px;
line-height: 1.6;
overflow: hidden;
}
.app-container {
display: flex;
height: 100vh;
width: 100vw;
overflow: hidden;
}
/* Panels */
.editor-panel,
.playground-panel {
display: flex;
flex-direction: column;
height: 100%;
overflow: hidden;
}
.editor-panel {
flex: 1;
background: var(--bg-secondary);
border-right: 1px solid var(--border-color);
min-width: 400px;
}
.playground-panel {
flex: 1;
background: var(--bg-primary);
min-width: 400px;
}
/* Panel Headers */
.panel-header {
display: flex;
justify-content: space-between;
align-items: center;
padding: 12px 16px;
background: var(--bg-tertiary);
border-bottom: 1px solid var(--border-color);
flex-shrink: 0;
}
.panel-header h2 {
font-size: 16px;
font-weight: 600;
color: var(--primary-color);
margin: 0;
}
.header-actions {
display: flex;
gap: 8px;
}
/* Action Buttons */
.action-btn {
display: flex;
align-items: center;
gap: 6px;
padding: 6px 12px;
background: var(--bg-secondary);
color: var(--text-secondary);
border: 1px solid var(--border-color);
border-radius: 4px;
font-family: inherit;
font-size: 13px;
cursor: pointer;
transition: all 0.2s;
}
.action-btn:hover {
background: var(--bg-tertiary);
color: var(--text-primary);
border-color: var(--border-hover);
}
.action-btn.primary {
background: var(--primary-color);
color: var(--bg-primary);
border-color: var(--primary-color);
}
.action-btn.primary:hover {
background: var(--primary-hover);
border-color: var(--primary-hover);
}
.action-btn .icon {
font-size: 16px;
}
/* Editor Wrapper */
.editor-wrapper {
flex: 1;
display: flex;
overflow: hidden;
position: relative;
z-index: 1; /* Ensure it's above any potential overlays */
}
.editor-wrapper .CodeMirror {
flex: 1;
height: 100%;
font-family: 'Dank Mono', monospace;
font-size: 14px;
line-height: 1.5;
}
/* Ensure CodeMirror is interactive */
.CodeMirror {
background: var(--bg-primary) !important;
}
.CodeMirror-scroll {
overflow: auto !important;
}
/* Make cursor more visible */
.CodeMirror-cursor {
border-left: 2px solid var(--primary-color) !important;
border-left-width: 2px !important;
opacity: 1 !important;
visibility: visible !important;
}
/* Ensure cursor is visible when focused */
.CodeMirror-focused .CodeMirror-cursor {
visibility: visible !important;
}
/* Fix for CodeMirror in flex container */
.CodeMirror-sizer {
min-height: auto !important;
}
/* Remove aggressive pointer-events override */
.CodeMirror-code {
cursor: text;
}
.editor-wrapper textarea {
display: none;
}
/* Output Section (Bottom of Editor) */
.output-section {
height: 250px;
border-top: 1px solid var(--border-color);
display: flex;
flex-direction: column;
flex-shrink: 0;
}
/* Tabs */
.tabs {
display: flex;
background: var(--bg-tertiary);
border-bottom: 1px solid var(--border-color);
flex-shrink: 0;
}
.tab {
padding: 8px 20px;
background: transparent;
color: var(--text-secondary);
border: none;
border-bottom: 2px solid transparent;
font-family: inherit;
font-size: 13px;
cursor: pointer;
transition: all 0.2s;
}
.tab:hover {
color: var(--text-primary);
background: var(--bg-secondary);
}
.tab.active {
color: var(--primary-color);
border-bottom-color: var(--primary-color);
}
/* Tab Content */
.tab-content {
flex: 1;
overflow: hidden;
}
.tab-pane {
display: none;
height: 100%;
overflow-y: auto;
}
.tab-pane.active {
display: block;
}
/* Console */
.console {
padding: 12px;
background: var(--bg-primary);
font-size: 13px;
min-height: 100%;
}
.console-line {
margin-bottom: 8px;
display: flex;
align-items: flex-start;
gap: 8px;
}
.console-prompt {
color: var(--primary-color);
flex-shrink: 0;
}
.console-text {
color: var(--text-primary);
}
.console-error {
color: var(--error-color);
}
.console-warning {
color: var(--warning-color);
}
.console-success {
color: var(--success-color);
}
/* JavaScript Output */
.js-output-header {
display: flex;
justify-content: flex-end;
padding: 8px 12px;
background: var(--bg-tertiary);
border-bottom: 1px solid var(--border-color);
}
.js-actions {
display: flex;
gap: 8px;
}
.mini-btn {
padding: 4px 8px;
background: var(--bg-secondary);
color: var(--text-secondary);
border: 1px solid var(--border-color);
border-radius: 3px;
font-size: 12px;
cursor: pointer;
transition: all 0.2s;
}
.mini-btn:hover {
background: var(--bg-primary);
color: var(--text-primary);
}
.js-output {
padding: 12px;
background: var(--code-bg);
color: var(--text-primary);
font-family: 'Dank Mono', monospace;
font-size: 13px;
line-height: 1.5;
white-space: pre-wrap;
margin: 0;
min-height: calc(100% - 44px);
}
/* Execution Progress */
.execution-progress {
padding: 12px;
background: var(--bg-primary);
}
.progress-item {
display: flex;
align-items: center;
gap: 8px;
margin-bottom: 8px;
font-size: 13px;
}
.progress-icon {
color: var(--text-muted);
}
.progress-item.active .progress-icon {
color: var(--info-color);
animation: pulse 1s infinite;
}
.progress-item.completed .progress-icon {
color: var(--success-color);
}
.progress-item.error .progress-icon {
color: var(--error-color);
}
/* Playground */
.playground-wrapper {
flex: 1;
overflow: hidden;
}
#playground-frame {
width: 100%;
height: 100%;
border: none;
background: var(--bg-secondary);
}
/* Tutorial Intro Modal */
.tutorial-intro-modal {
position: fixed;
top: 0;
left: 0;
right: 0;
bottom: 0;
background: var(--modal-overlay);
display: flex;
align-items: center;
justify-content: center;
z-index: 2000;
transition: opacity 0.3s;
}
.tutorial-intro-modal.hidden {
display: none;
}
.intro-content {
background: var(--bg-tertiary);
border: 1px solid var(--border-color);
border-radius: 8px;
padding: 32px;
max-width: 500px;
box-shadow: 0 16px 48px rgba(0, 0, 0, 0.6);
}
.intro-content h2 {
color: var(--primary-color);
margin-bottom: 16px;
font-size: 24px;
}
.intro-content p {
color: var(--text-primary);
margin-bottom: 16px;
line-height: 1.6;
}
.intro-content ul {
list-style: none;
margin-bottom: 24px;
}
.intro-content li {
color: var(--text-secondary);
margin-bottom: 8px;
padding-left: 20px;
position: relative;
}
.intro-content li:before {
content: "▸";
position: absolute;
left: 0;
color: var(--primary-color);
}
.intro-actions {
display: flex;
gap: 12px;
justify-content: flex-end;
}
.intro-btn {
padding: 10px 24px;
background: var(--bg-secondary);
color: var(--text-primary);
border: 1px solid var(--border-color);
border-radius: 4px;
font-family: inherit;
font-size: 14px;
cursor: pointer;
transition: all 0.2s;
}
.intro-btn:hover {
background: var(--bg-primary);
border-color: var(--border-hover);
}
.intro-btn.primary {
background: var(--primary-color);
color: var(--bg-primary);
border-color: var(--primary-color);
}
.intro-btn.primary:hover {
background: var(--primary-hover);
border-color: var(--primary-hover);
}
/* Tutorial Navigation Bar */
.tutorial-nav {
position: fixed;
top: 0;
left: 0;
right: 0;
background: var(--bg-tertiary);
border-bottom: 1px solid var(--primary-color);
z-index: 1000;
transition: transform 0.3s;
}
.tutorial-nav.hidden {
transform: translateY(-100%);
}
.tutorial-nav-content {
display: flex;
align-items: center;
justify-content: space-between;
padding: 16px 24px;
}
.tutorial-left {
flex: 1;
}
.tutorial-step-title {
display: flex;
align-items: center;
gap: 16px;
margin-bottom: 8px;
}
.tutorial-step-title span:first-child {
color: var(--text-secondary);
font-size: 12px;
text-transform: uppercase;
}
.tutorial-step-title span:last-child {
color: var(--primary-color);
font-weight: 600;
font-size: 16px;
}
.tutorial-description {
color: var(--text-primary);
margin: 0;
font-size: 14px;
max-width: 600px;
}
.tutorial-right {
display: flex;
align-items: center;
}
.tutorial-progress-bar {
height: 3px;
background: var(--bg-secondary);
position: absolute;
bottom: 0;
left: 0;
right: 0;
}
.tutorial-progress-bar .progress-fill {
height: 100%;
background: var(--primary-color);
transition: width 0.3s;
}
/* Adjust app container when tutorial is active */
.app-container.tutorial-active {
padding-top: 80px;
}
.tutorial-controls {
display: flex;
gap: 12px;
}
.nav-btn {
padding: 8px 16px;
background: var(--bg-secondary);
color: var(--text-primary);
border: 1px solid var(--border-color);
border-radius: 4px;
font-family: inherit;
font-size: 13px;
cursor: pointer;
transition: all 0.2s;
}
.nav-btn:hover:not(:disabled) {
background: var(--bg-primary);
border-color: var(--border-hover);
}
.nav-btn:disabled {
opacity: 0.5;
cursor: not-allowed;
}
.nav-btn.primary {
background: var(--primary-color);
color: var(--bg-primary);
border-color: var(--primary-color);
}
.nav-btn.primary:hover {
background: var(--primary-hover);
border-color: var(--primary-hover);
}
.exit-btn {
width: 32px;
height: 32px;
background: transparent;
color: var(--text-secondary);
border: none;
font-size: 20px;
cursor: pointer;
border-radius: 4px;
transition: all 0.2s;
margin-left: 16px;
}
.exit-btn:hover {
background: var(--bg-secondary);
color: var(--text-primary);
}
/* Fullscreen Mode */
.playground-panel.fullscreen {
position: fixed;
top: 0;
left: 0;
right: 0;
bottom: 0;
z-index: 1500;
}
/* Animations */
@keyframes pulse {
0%, 100% { opacity: 1; }
50% { opacity: 0.5; }
}
/* Scrollbar Styling */
::-webkit-scrollbar {
width: 10px;
height: 10px;
}
::-webkit-scrollbar-track {
background: var(--bg-secondary);
}
::-webkit-scrollbar-thumb {
background: var(--border-color);
border-radius: 5px;
}
::-webkit-scrollbar-thumb:hover {
background: var(--border-hover);
}
/* Responsive */
@media (max-width: 768px) {
.app-container {
flex-direction: column;
}
.editor-panel,
.playground-panel {
min-width: auto;
width: 100%;
}
.editor-panel {
border-right: none;
border-bottom: 1px solid var(--border-color);
}
.output-section {
height: 200px;
}
}
/* ================================================================
Recording Timeline Styles
================================================================ */
.action-btn.record {
background: var(--bg-tertiary);
border-color: var(--error-color);
}
.action-btn.record:hover {
background: var(--error-color);
border-color: var(--error-color);
}
.action-btn.record.recording {
background: var(--error-color);
animation: pulse 1.5s infinite;
}
.action-btn.record.recording .icon {
animation: blink 1s infinite;
}
@keyframes pulse {
0%, 100% { opacity: 1; }
50% { opacity: 0.8; }
}
@keyframes blink {
0%, 100% { opacity: 1; }
50% { opacity: 0.3; }
}
.editor-container {
flex: 1;
display: flex;
flex-direction: column;
overflow: hidden;
}
#editor-view,
#timeline-view {
flex: 1;
display: flex;
flex-direction: column;
overflow: hidden;
}
.recording-timeline {
background: var(--bg-secondary);
display: flex;
flex-direction: column;
height: 100%;
}
.timeline-header {
display: flex;
justify-content: space-between;
align-items: center;
padding: 10px 15px;
border-bottom: 1px solid var(--border-color);
background: var(--bg-tertiary);
}
.timeline-header h3 {
font-size: 14px;
font-weight: 600;
color: var(--text-primary);
margin: 0;
}
.timeline-actions {
display: flex;
gap: 8px;
}
.timeline-events {
flex: 1;
overflow-y: auto;
padding: 10px;
}
.timeline-event {
display: flex;
align-items: center;
padding: 8px 10px;
margin-bottom: 6px;
background: var(--bg-tertiary);
border: 1px solid var(--border-color);
border-radius: 4px;
transition: all 0.2s;
cursor: pointer;
}
.timeline-event:hover {
border-color: var(--border-hover);
background: var(--code-bg);
}
.timeline-event.selected {
border-color: var(--primary-color);
background: rgba(15, 187, 170, 0.1);
}
.event-checkbox {
margin-right: 10px;
width: 16px;
height: 16px;
cursor: pointer;
}
.event-time {
font-size: 11px;
color: var(--text-muted);
margin-right: 10px;
font-family: 'Dank Mono', monospace;
min-width: 45px;
}
.event-command {
flex: 1;
font-family: 'Dank Mono', monospace;
font-size: 13px;
color: var(--text-primary);
}
.event-command .cmd-name {
color: var(--primary-color);
font-weight: 600;
}
.event-command .cmd-selector {
color: var(--info-color);
}
.event-command .cmd-value {
color: var(--warning-color);
}
.event-command .cmd-detail {
color: var(--text-secondary);
font-size: 11px;
margin-left: 5px;
}
.event-edit {
margin-left: 10px;
padding: 2px 8px;
font-size: 11px;
background: var(--bg-secondary);
border: 1px solid var(--border-color);
color: var(--text-secondary);
cursor: pointer;
border-radius: 3px;
transition: all 0.2s;
}
.event-edit:hover {
border-color: var(--primary-color);
color: var(--primary-color);
}
/* Event Editor Modal */
.modal-overlay {
position: fixed;
top: 0;
left: 0;
right: 0;
bottom: 0;
background: var(--modal-overlay);
z-index: 999;
}
.event-editor-modal {
position: fixed;
top: 50%;
left: 50%;
transform: translate(-50%, -50%);
background: var(--bg-secondary);
border: 1px solid var(--border-color);
border-radius: 8px;
padding: 20px;
z-index: 1000;
min-width: 400px;
}
.event-editor-modal h4 {
margin: 0 0 15px 0;
color: var(--text-primary);
font-family: 'Dank Mono', monospace;
}
.editor-field {
margin-bottom: 15px;
}
.editor-field label {
display: block;
margin-bottom: 5px;
font-size: 12px;
color: var(--text-secondary);
font-family: 'Dank Mono', monospace;
}
.editor-field input,
.editor-field select {
width: 100%;
padding: 8px;
background: var(--bg-tertiary);
border: 1px solid var(--border-color);
color: var(--text-primary);
border-radius: 4px;
font-family: 'Dank Mono', monospace;
font-size: 13px;
}
.editor-field input:focus,
.editor-field select:focus {
outline: none;
border-color: var(--primary-color);
}
.editor-actions {
display: flex;
justify-content: flex-end;
gap: 10px;
margin-top: 20px;
}
/* Blockly Button */
#blockly-btn .icon {
font-size: 16px;
}
/* Hidden State */
.hidden {
display: none !important;
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,591 @@
// Blockly Manager for C4A-Script
// Handles Blockly workspace, code generation, and synchronization with text editor
class BlocklyManager {
constructor(tutorialApp) {
this.app = tutorialApp;
this.workspace = null;
this.isUpdating = false; // Prevent circular updates
this.blocklyVisible = false;
this.toolboxXml = this.generateToolbox();
this.init();
}
init() {
this.setupBlocklyContainer();
this.initializeWorkspace();
this.setupEventHandlers();
this.setupSynchronization();
}
setupBlocklyContainer() {
// Create blockly container div
const editorContainer = document.querySelector('.editor-container');
const blocklyDiv = document.createElement('div');
blocklyDiv.id = 'blockly-view';
blocklyDiv.className = 'blockly-workspace hidden';
blocklyDiv.style.height = '100%';
blocklyDiv.style.width = '100%';
editorContainer.appendChild(blocklyDiv);
}
generateToolbox() {
return `
<xml id="toolbox" style="display: none">
<category name="Navigation" colour="${BlockColors.NAVIGATION}">
<block type="c4a_go"></block>
<block type="c4a_reload"></block>
<block type="c4a_back"></block>
<block type="c4a_forward"></block>
</category>
<category name="Wait" colour="${BlockColors.WAIT}">
<block type="c4a_wait_time">
<field name="SECONDS">3</field>
</block>
<block type="c4a_wait_selector">
<field name="SELECTOR">#content</field>
<field name="TIMEOUT">10</field>
</block>
<block type="c4a_wait_text">
<field name="TEXT">Loading complete</field>
<field name="TIMEOUT">5</field>
</block>
</category>
<category name="Mouse Actions" colour="${BlockColors.ACTIONS}">
<block type="c4a_click">
<field name="SELECTOR">button.submit</field>
</block>
<block type="c4a_click_xy"></block>
<block type="c4a_double_click"></block>
<block type="c4a_right_click"></block>
<block type="c4a_move"></block>
<block type="c4a_drag"></block>
<block type="c4a_scroll">
<field name="DIRECTION">DOWN</field>
<field name="AMOUNT">500</field>
</block>
</category>
<category name="Keyboard" colour="${BlockColors.KEYBOARD}">
<block type="c4a_type">
<field name="TEXT">hello@example.com</field>
</block>
<block type="c4a_type_var">
<field name="VAR">email</field>
</block>
<block type="c4a_clear"></block>
<block type="c4a_set">
<field name="SELECTOR">#email</field>
<field name="VALUE">user@example.com</field>
</block>
<block type="c4a_press">
<field name="KEY">Tab</field>
</block>
<block type="c4a_key_down">
<field name="KEY">Shift</field>
</block>
<block type="c4a_key_up">
<field name="KEY">Shift</field>
</block>
</category>
<category name="Control Flow" colour="${BlockColors.CONTROL}">
<block type="c4a_if_exists">
<field name="SELECTOR">.cookie-banner</field>
</block>
<block type="c4a_if_exists_else">
<field name="SELECTOR">#user</field>
</block>
<block type="c4a_if_not_exists">
<field name="SELECTOR">.modal</field>
</block>
<block type="c4a_if_js">
<field name="CONDITION">window.innerWidth < 768</field>
</block>
<block type="c4a_repeat_times">
<field name="TIMES">5</field>
</block>
<block type="c4a_repeat_while">
<field name="CONDITION">document.querySelector('.load-more')</field>
</block>
</category>
<category name="Variables" colour="${BlockColors.VARIABLES}">
<block type="c4a_setvar">
<field name="NAME">username</field>
<field name="VALUE">john@example.com</field>
</block>
<block type="c4a_eval">
<field name="CODE">console.log('Hello')</field>
</block>
</category>
<category name="Procedures" colour="${BlockColors.PROCEDURES}">
<block type="c4a_proc_def">
<field name="NAME">login</field>
</block>
<block type="c4a_proc_call">
<field name="NAME">login</field>
</block>
</category>
<category name="Comments" colour="#9E9E9E">
<block type="c4a_comment">
<field name="TEXT">Add comment here</field>
</block>
</category>
</xml>`;
}
initializeWorkspace() {
const blocklyDiv = document.getElementById('blockly-view');
// Dark theme configuration
const theme = Blockly.Theme.defineTheme('c4a-dark', {
'base': Blockly.Themes.Classic,
'componentStyles': {
'workspaceBackgroundColour': '#0e0e10',
'toolboxBackgroundColour': '#1a1a1b',
'toolboxForegroundColour': '#e0e0e0',
'flyoutBackgroundColour': '#1a1a1b',
'flyoutForegroundColour': '#e0e0e0',
'flyoutOpacity': 0.9,
'scrollbarColour': '#2a2a2c',
'scrollbarOpacity': 0.5,
'insertionMarkerColour': '#0fbbaa',
'insertionMarkerOpacity': 0.3,
'markerColour': '#0fbbaa',
'cursorColour': '#0fbbaa',
'selectedGlowColour': '#0fbbaa',
'selectedGlowOpacity': 0.4,
'replacementGlowColour': '#0fbbaa',
'replacementGlowOpacity': 0.5
},
'fontStyle': {
'family': 'Dank Mono, Monaco, Consolas, monospace',
'weight': 'normal',
'size': 13
}
});
this.workspace = Blockly.inject(blocklyDiv, {
toolbox: this.toolboxXml,
theme: theme,
grid: {
spacing: 20,
length: 3,
colour: '#2a2a2c',
snap: true
},
zoom: {
controls: true,
wheel: true,
startScale: 1.0,
maxScale: 3,
minScale: 0.3,
scaleSpeed: 1.2
},
trashcan: true,
sounds: false,
media: 'https://unpkg.com/blockly/media/'
});
// Add workspace change listener
this.workspace.addChangeListener((event) => {
if (!this.isUpdating && event.type !== Blockly.Events.UI) {
this.syncBlocksToCode();
}
});
}
setupEventHandlers() {
// Add blockly toggle button
const headerActions = document.querySelector('.editor-panel .header-actions');
const blocklyBtn = document.createElement('button');
blocklyBtn.id = 'blockly-btn';
blocklyBtn.className = 'action-btn';
blocklyBtn.title = 'Toggle Blockly Mode';
blocklyBtn.innerHTML = '<span class="icon">🧩</span>';
// Insert before the Run button
const runBtn = document.getElementById('run-btn');
headerActions.insertBefore(blocklyBtn, runBtn);
blocklyBtn.addEventListener('click', () => this.toggleBlocklyView());
}
setupSynchronization() {
// Listen to CodeMirror changes
this.app.editor.on('change', (instance, changeObj) => {
if (!this.isUpdating && this.blocklyVisible && changeObj.origin !== 'setValue') {
this.syncCodeToBlocks();
}
});
}
toggleBlocklyView() {
const editorView = document.getElementById('editor-view');
const blocklyView = document.getElementById('blockly-view');
const timelineView = document.getElementById('timeline-view');
const blocklyBtn = document.getElementById('blockly-btn');
this.blocklyVisible = !this.blocklyVisible;
if (this.blocklyVisible) {
// Show Blockly
editorView.classList.add('hidden');
timelineView.classList.add('hidden');
blocklyView.classList.remove('hidden');
blocklyBtn.classList.add('active');
// Resize workspace
Blockly.svgResize(this.workspace);
// Sync current code to blocks
this.syncCodeToBlocks();
} else {
// Show editor
blocklyView.classList.add('hidden');
editorView.classList.remove('hidden');
blocklyBtn.classList.remove('active');
// Refresh CodeMirror
setTimeout(() => this.app.editor.refresh(), 100);
}
}
syncBlocksToCode() {
if (this.isUpdating) return;
try {
this.isUpdating = true;
// Generate C4A-Script from blocks using our custom generator
if (typeof c4aGenerator !== 'undefined') {
const code = c4aGenerator.workspaceToCode(this.workspace);
// Process the code to maintain proper formatting
const lines = code.split('\n');
const formattedLines = [];
let lastWasComment = false;
for (let i = 0; i < lines.length; i++) {
const line = lines[i].trim();
if (!line) continue;
const isComment = line.startsWith('#');
// Add blank line when transitioning between comments and commands
if (formattedLines.length > 0 && lastWasComment !== isComment) {
formattedLines.push('');
}
formattedLines.push(line);
lastWasComment = isComment;
}
const cleanCode = formattedLines.join('\n');
// Update CodeMirror
this.app.editor.setValue(cleanCode);
}
} catch (error) {
console.error('Error syncing blocks to code:', error);
} finally {
this.isUpdating = false;
}
}
syncCodeToBlocks() {
if (this.isUpdating) return;
try {
this.isUpdating = true;
// Clear workspace
this.workspace.clear();
// Parse C4A-Script and generate blocks
const code = this.app.editor.getValue();
const blocks = this.parseC4AToBlocks(code);
if (blocks) {
Blockly.Xml.domToWorkspace(blocks, this.workspace);
}
} catch (error) {
console.error('Error syncing code to blocks:', error);
// Show error in console
this.app.addConsoleMessage(`Blockly sync error: ${error.message}`, 'warning');
} finally {
this.isUpdating = false;
}
}
parseC4AToBlocks(code) {
const lines = code.split('\n');
const xml = document.createElement('xml');
let yPos = 20;
let previousBlock = null;
let rootBlock = null;
for (let i = 0; i < lines.length; i++) {
const line = lines[i].trim();
// Skip empty lines
if (!line) continue;
// Handle comments
if (line.startsWith('#')) {
const commentBlock = this.parseLineToBlock(line, i, lines);
if (commentBlock) {
if (previousBlock) {
// Connect to previous block
const next = document.createElement('next');
next.appendChild(commentBlock);
previousBlock.appendChild(next);
} else {
// First block - set position
commentBlock.setAttribute('x', 20);
commentBlock.setAttribute('y', yPos);
xml.appendChild(commentBlock);
rootBlock = commentBlock;
yPos += 60;
}
previousBlock = commentBlock;
}
continue;
}
const block = this.parseLineToBlock(line, i, lines);
if (block) {
if (previousBlock) {
// Connect to previous block using <next>
const next = document.createElement('next');
next.appendChild(block);
previousBlock.appendChild(next);
} else {
// First block - set position
block.setAttribute('x', 20);
block.setAttribute('y', yPos);
xml.appendChild(block);
rootBlock = block;
yPos += 60;
}
previousBlock = block;
}
}
return xml;
}
parseLineToBlock(line, index, allLines) {
// Navigation commands
if (line.startsWith('GO ')) {
const url = line.substring(3).trim();
return this.createBlock('c4a_go', { 'URL': url });
}
if (line === 'RELOAD') {
return this.createBlock('c4a_reload');
}
if (line === 'BACK') {
return this.createBlock('c4a_back');
}
if (line === 'FORWARD') {
return this.createBlock('c4a_forward');
}
// Wait commands
if (line.startsWith('WAIT ')) {
const parts = line.substring(5).trim();
// Check if it's just a number (wait time)
if (/^\d+(\.\d+)?$/.test(parts)) {
return this.createBlock('c4a_wait_time', { 'SECONDS': parts });
}
// Check for selector wait
const selectorMatch = parts.match(/^`([^`]+)`\s+(\d+)$/);
if (selectorMatch) {
return this.createBlock('c4a_wait_selector', {
'SELECTOR': selectorMatch[1],
'TIMEOUT': selectorMatch[2]
});
}
// Check for text wait
const textMatch = parts.match(/^"([^"]+)"\s+(\d+)$/);
if (textMatch) {
return this.createBlock('c4a_wait_text', {
'TEXT': textMatch[1],
'TIMEOUT': textMatch[2]
});
}
}
// Click commands
if (line.startsWith('CLICK ')) {
const target = line.substring(6).trim();
// Check for coordinates
const coordMatch = target.match(/^(\d+)\s+(\d+)$/);
if (coordMatch) {
return this.createBlock('c4a_click_xy', {
'X': coordMatch[1],
'Y': coordMatch[2]
});
}
// Selector click
const selectorMatch = target.match(/^`([^`]+)`$/);
if (selectorMatch) {
return this.createBlock('c4a_click', {
'SELECTOR': selectorMatch[1]
});
}
}
// Other mouse actions
if (line.startsWith('DOUBLE_CLICK ')) {
const selector = line.substring(13).trim().match(/^`([^`]+)`$/);
if (selector) {
return this.createBlock('c4a_double_click', {
'SELECTOR': selector[1]
});
}
}
if (line.startsWith('RIGHT_CLICK ')) {
const selector = line.substring(12).trim().match(/^`([^`]+)`$/);
if (selector) {
return this.createBlock('c4a_right_click', {
'SELECTOR': selector[1]
});
}
}
// Scroll
if (line.startsWith('SCROLL ')) {
const match = line.match(/^SCROLL\s+(UP|DOWN|LEFT|RIGHT)(?:\s+(\d+))?$/);
if (match) {
return this.createBlock('c4a_scroll', {
'DIRECTION': match[1],
'AMOUNT': match[2] || '500'
});
}
}
// Type commands
if (line.startsWith('TYPE ')) {
const content = line.substring(5).trim();
// Variable type
if (content.startsWith('$')) {
return this.createBlock('c4a_type_var', {
'VAR': content.substring(1)
});
}
// Text type
const textMatch = content.match(/^"([^"]*)"$/);
if (textMatch) {
return this.createBlock('c4a_type', {
'TEXT': textMatch[1]
});
}
}
// SET command
if (line.startsWith('SET ')) {
const match = line.match(/^SET\s+`([^`]+)`\s+"([^"]*)"$/);
if (match) {
return this.createBlock('c4a_set', {
'SELECTOR': match[1],
'VALUE': match[2]
});
}
}
// CLEAR command
if (line.startsWith('CLEAR ')) {
const match = line.match(/^CLEAR\s+`([^`]+)`$/);
if (match) {
return this.createBlock('c4a_clear', {
'SELECTOR': match[1]
});
}
}
// SETVAR command
if (line.startsWith('SETVAR ')) {
const match = line.match(/^SETVAR\s+(\w+)\s*=\s*"([^"]*)"$/);
if (match) {
return this.createBlock('c4a_setvar', {
'NAME': match[1],
'VALUE': match[2]
});
}
}
// IF commands (simplified - only single line)
if (line.startsWith('IF ')) {
// IF EXISTS
const existsMatch = line.match(/^IF\s+\(EXISTS\s+`([^`]+)`\)\s+THEN\s+(.+?)(?:\s+ELSE\s+(.+))?$/);
if (existsMatch) {
if (existsMatch[3]) {
// Has ELSE
const block = this.createBlock('c4a_if_exists_else', {
'SELECTOR': existsMatch[1]
});
// Parse then and else commands - simplified for now
return block;
} else {
// No ELSE
const block = this.createBlock('c4a_if_exists', {
'SELECTOR': existsMatch[1]
});
return block;
}
}
// IF NOT EXISTS
const notExistsMatch = line.match(/^IF\s+\(NOT\s+EXISTS\s+`([^`]+)`\)\s+THEN\s+(.+)$/);
if (notExistsMatch) {
const block = this.createBlock('c4a_if_not_exists', {
'SELECTOR': notExistsMatch[1]
});
return block;
}
}
// Comments
if (line.startsWith('#')) {
return this.createBlock('c4a_comment', {
'TEXT': line.substring(1).trim()
});
}
// If we can't parse it, return null
return null;
}
createBlock(type, fields = {}) {
const block = document.createElement('block');
block.setAttribute('type', type);
// Add fields
for (const [name, value] of Object.entries(fields)) {
const field = document.createElement('field');
field.setAttribute('name', name);
field.textContent = value;
block.appendChild(field);
}
return block;
}
}

View File

@@ -0,0 +1,238 @@
/* Blockly Theme CSS for C4A-Script */
/* Blockly workspace container */
.blockly-workspace {
position: relative;
width: 100%;
height: 100%;
background: var(--bg-primary);
}
/* Blockly button active state */
#blockly-btn.active {
background: var(--primary-color);
color: var(--bg-primary);
border-color: var(--primary-color);
}
#blockly-btn.active:hover {
background: var(--primary-hover);
border-color: var(--primary-hover);
}
/* Override Blockly's default styles for dark theme */
.blocklyToolboxDiv {
background-color: var(--bg-tertiary) !important;
border-right: 1px solid var(--border-color) !important;
}
.blocklyFlyout {
background-color: var(--bg-secondary) !important;
}
.blocklyFlyoutBackground {
fill: var(--bg-secondary) !important;
}
.blocklyMainBackground {
stroke: none !important;
}
.blocklyTreeRow {
color: var(--text-primary) !important;
font-family: 'Dank Mono', monospace !important;
padding: 4px 16px !important;
margin: 2px 0 !important;
}
.blocklyTreeRow:hover {
background-color: var(--bg-secondary) !important;
}
.blocklyTreeSelected {
background-color: var(--primary-dim) !important;
}
.blocklyTreeLabel {
cursor: pointer;
}
/* Blockly scrollbars */
.blocklyScrollbarHorizontal,
.blocklyScrollbarVertical {
background-color: transparent !important;
}
.blocklyScrollbarHandle {
fill: var(--border-color) !important;
opacity: 0.5 !important;
}
.blocklyScrollbarHandle:hover {
fill: var(--border-hover) !important;
opacity: 0.8 !important;
}
/* Blockly zoom controls */
.blocklyZoom > image {
opacity: 0.6;
}
.blocklyZoom > image:hover {
opacity: 1;
}
/* Blockly trash can */
.blocklyTrash {
opacity: 0.6;
}
.blocklyTrash:hover {
opacity: 1;
}
/* Blockly context menus */
.blocklyContextMenu {
background-color: var(--bg-tertiary) !important;
border: 1px solid var(--border-color) !important;
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.3) !important;
}
.blocklyMenuItem {
color: var(--text-primary) !important;
font-family: 'Dank Mono', monospace !important;
}
.blocklyMenuItemDisabled {
color: var(--text-muted) !important;
}
.blocklyMenuItem:hover {
background-color: var(--bg-secondary) !important;
}
/* Blockly text inputs */
.blocklyHtmlInput {
background-color: var(--bg-tertiary) !important;
color: var(--text-primary) !important;
border: 1px solid var(--border-color) !important;
font-family: 'Dank Mono', monospace !important;
font-size: 13px !important;
padding: 4px 8px !important;
}
.blocklyHtmlInput:focus {
border-color: var(--primary-color) !important;
outline: none !important;
}
/* Blockly dropdowns */
.blocklyDropDownDiv {
background-color: var(--bg-tertiary) !important;
border: 1px solid var(--border-color) !important;
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.3) !important;
}
.blocklyDropDownContent {
color: var(--text-primary) !important;
}
.blocklyDropDownDiv .goog-menuitem {
color: var(--text-primary) !important;
font-family: 'Dank Mono', monospace !important;
padding: 4px 16px !important;
}
.blocklyDropDownDiv .goog-menuitem-highlight,
.blocklyDropDownDiv .goog-menuitem-hover {
background-color: var(--bg-secondary) !important;
}
/* Custom block colors are defined in the block definitions */
/* Block text styling */
.blocklyText {
fill: #ffffff !important;
font-family: 'Dank Mono', monospace !important;
font-size: 13px !important;
}
.blocklyEditableText > .blocklyText {
fill: #ffffff !important;
}
.blocklyEditableText:hover > rect {
stroke: var(--primary-color) !important;
stroke-width: 2px !important;
}
/* Improve visibility of connection highlights */
.blocklyHighlightedConnectionPath {
stroke: var(--primary-color) !important;
stroke-width: 4px !important;
}
.blocklyInsertionMarker > .blocklyPath {
fill-opacity: 0.3 !important;
stroke-opacity: 0.6 !important;
}
/* Workspace grid pattern */
.blocklyWorkspace > .blocklyBlockCanvas > .blocklyGridCanvas {
opacity: 0.1;
}
/* Smooth transitions */
.blocklyDraggable {
transition: transform 0.1s ease;
}
/* Field labels */
.blocklyFieldLabel {
font-weight: normal !important;
}
/* Comment blocks styling */
.blocklyCommentText {
font-style: italic !important;
}
/* Make comment blocks slightly transparent */
g[data-category="Comments"] .blocklyPath {
fill-opacity: 0.8 !important;
}
/* Better visibility for disabled blocks */
.blocklyDisabled > .blocklyPath {
fill-opacity: 0.3 !important;
}
.blocklyDisabled > .blocklyText {
fill-opacity: 0.5 !important;
}
/* Warning and error text */
.blocklyWarningText,
.blocklyErrorText {
font-family: 'Dank Mono', monospace !important;
font-size: 12px !important;
}
/* Workspace scrollbar improvement for dark theme */
::-webkit-scrollbar {
width: 10px;
height: 10px;
}
::-webkit-scrollbar-track {
background: var(--bg-secondary);
}
::-webkit-scrollbar-thumb {
background: var(--border-color);
border-radius: 5px;
}
::-webkit-scrollbar-thumb:hover {
background: var(--border-hover);
}

View File

@@ -0,0 +1,549 @@
// C4A-Script Blockly Block Definitions
// This file defines all custom blocks for C4A-Script commands
// Color scheme for different block categories
const BlockColors = {
NAVIGATION: '#1E88E5', // Blue
ACTIONS: '#43A047', // Green
CONTROL: '#FB8C00', // Orange
VARIABLES: '#8E24AA', // Purple
WAIT: '#E53935', // Red
KEYBOARD: '#00ACC1', // Cyan
PROCEDURES: '#6A1B9A' // Deep Purple
};
// Helper to create selector input with backticks
Blockly.Blocks['c4a_selector_input'] = {
init: function() {
this.appendDummyInput()
.appendField("`")
.appendField(new Blockly.FieldTextInput("selector"), "SELECTOR")
.appendField("`");
this.setOutput(true, "Selector");
this.setColour(BlockColors.ACTIONS);
this.setTooltip("CSS selector for element");
}
};
// ============================================
// NAVIGATION BLOCKS
// ============================================
Blockly.Blocks['c4a_go'] = {
init: function() {
this.appendDummyInput()
.appendField("GO")
.appendField(new Blockly.FieldTextInput("https://example.com"), "URL");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.NAVIGATION);
this.setTooltip("Navigate to URL");
}
};
Blockly.Blocks['c4a_reload'] = {
init: function() {
this.appendDummyInput()
.appendField("RELOAD");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.NAVIGATION);
this.setTooltip("Reload current page");
}
};
Blockly.Blocks['c4a_back'] = {
init: function() {
this.appendDummyInput()
.appendField("BACK");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.NAVIGATION);
this.setTooltip("Go back in browser history");
}
};
Blockly.Blocks['c4a_forward'] = {
init: function() {
this.appendDummyInput()
.appendField("FORWARD");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.NAVIGATION);
this.setTooltip("Go forward in browser history");
}
};
// ============================================
// WAIT BLOCKS
// ============================================
Blockly.Blocks['c4a_wait_time'] = {
init: function() {
this.appendDummyInput()
.appendField("WAIT")
.appendField(new Blockly.FieldNumber(1, 0), "SECONDS")
.appendField("seconds");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.WAIT);
this.setTooltip("Wait for specified seconds");
}
};
Blockly.Blocks['c4a_wait_selector'] = {
init: function() {
this.appendDummyInput()
.appendField("WAIT for")
.appendField("`")
.appendField(new Blockly.FieldTextInput("selector"), "SELECTOR")
.appendField("`")
.appendField("max")
.appendField(new Blockly.FieldNumber(10, 1), "TIMEOUT")
.appendField("sec");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.WAIT);
this.setTooltip("Wait for element to appear");
}
};
Blockly.Blocks['c4a_wait_text'] = {
init: function() {
this.appendDummyInput()
.appendField("WAIT for text")
.appendField(new Blockly.FieldTextInput("Loading complete"), "TEXT")
.appendField("max")
.appendField(new Blockly.FieldNumber(5, 1), "TIMEOUT")
.appendField("sec");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.WAIT);
this.setTooltip("Wait for text to appear on page");
}
};
// ============================================
// MOUSE ACTION BLOCKS
// ============================================
Blockly.Blocks['c4a_click'] = {
init: function() {
this.appendDummyInput()
.appendField("CLICK")
.appendField("`")
.appendField(new Blockly.FieldTextInput("button"), "SELECTOR")
.appendField("`");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.ACTIONS);
this.setTooltip("Click on element");
}
};
Blockly.Blocks['c4a_click_xy'] = {
init: function() {
this.appendDummyInput()
.appendField("CLICK at")
.appendField("X:")
.appendField(new Blockly.FieldNumber(100, 0), "X")
.appendField("Y:")
.appendField(new Blockly.FieldNumber(100, 0), "Y");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.ACTIONS);
this.setTooltip("Click at coordinates");
}
};
Blockly.Blocks['c4a_double_click'] = {
init: function() {
this.appendDummyInput()
.appendField("DOUBLE_CLICK")
.appendField("`")
.appendField(new Blockly.FieldTextInput(".item"), "SELECTOR")
.appendField("`");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.ACTIONS);
this.setTooltip("Double click on element");
}
};
Blockly.Blocks['c4a_right_click'] = {
init: function() {
this.appendDummyInput()
.appendField("RIGHT_CLICK")
.appendField("`")
.appendField(new Blockly.FieldTextInput("#menu"), "SELECTOR")
.appendField("`");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.ACTIONS);
this.setTooltip("Right click on element");
}
};
Blockly.Blocks['c4a_move'] = {
init: function() {
this.appendDummyInput()
.appendField("MOVE to")
.appendField("X:")
.appendField(new Blockly.FieldNumber(500, 0), "X")
.appendField("Y:")
.appendField(new Blockly.FieldNumber(300, 0), "Y");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.ACTIONS);
this.setTooltip("Move mouse to position");
}
};
Blockly.Blocks['c4a_drag'] = {
init: function() {
this.appendDummyInput()
.appendField("DRAG from")
.appendField("X:")
.appendField(new Blockly.FieldNumber(100, 0), "X1")
.appendField("Y:")
.appendField(new Blockly.FieldNumber(100, 0), "Y1");
this.appendDummyInput()
.appendField("to")
.appendField("X:")
.appendField(new Blockly.FieldNumber(500, 0), "X2")
.appendField("Y:")
.appendField(new Blockly.FieldNumber(300, 0), "Y2");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.ACTIONS);
this.setTooltip("Drag from one point to another");
}
};
Blockly.Blocks['c4a_scroll'] = {
init: function() {
this.appendDummyInput()
.appendField("SCROLL")
.appendField(new Blockly.FieldDropdown([
["DOWN", "DOWN"],
["UP", "UP"],
["LEFT", "LEFT"],
["RIGHT", "RIGHT"]
]), "DIRECTION")
.appendField(new Blockly.FieldNumber(500, 0), "AMOUNT")
.appendField("pixels");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.ACTIONS);
this.setTooltip("Scroll in direction");
}
};
// ============================================
// KEYBOARD BLOCKS
// ============================================
Blockly.Blocks['c4a_type'] = {
init: function() {
this.appendDummyInput()
.appendField("TYPE")
.appendField(new Blockly.FieldTextInput("text to type"), "TEXT");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.KEYBOARD);
this.setTooltip("Type text");
}
};
Blockly.Blocks['c4a_type_var'] = {
init: function() {
this.appendDummyInput()
.appendField("TYPE")
.appendField("$")
.appendField(new Blockly.FieldTextInput("variable"), "VAR");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.KEYBOARD);
this.setTooltip("Type variable value");
}
};
Blockly.Blocks['c4a_clear'] = {
init: function() {
this.appendDummyInput()
.appendField("CLEAR")
.appendField("`")
.appendField(new Blockly.FieldTextInput("input"), "SELECTOR")
.appendField("`");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.KEYBOARD);
this.setTooltip("Clear input field");
}
};
Blockly.Blocks['c4a_set'] = {
init: function() {
this.appendDummyInput()
.appendField("SET")
.appendField("`")
.appendField(new Blockly.FieldTextInput("#input"), "SELECTOR")
.appendField("`")
.appendField("to")
.appendField(new Blockly.FieldTextInput("value"), "VALUE");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.KEYBOARD);
this.setTooltip("Set input field value");
}
};
Blockly.Blocks['c4a_press'] = {
init: function() {
this.appendDummyInput()
.appendField("PRESS")
.appendField(new Blockly.FieldDropdown([
["Tab", "Tab"],
["Enter", "Enter"],
["Escape", "Escape"],
["Space", "Space"],
["ArrowUp", "ArrowUp"],
["ArrowDown", "ArrowDown"],
["ArrowLeft", "ArrowLeft"],
["ArrowRight", "ArrowRight"],
["Delete", "Delete"],
["Backspace", "Backspace"]
]), "KEY");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.KEYBOARD);
this.setTooltip("Press and release key");
}
};
Blockly.Blocks['c4a_key_down'] = {
init: function() {
this.appendDummyInput()
.appendField("KEY_DOWN")
.appendField(new Blockly.FieldDropdown([
["Shift", "Shift"],
["Control", "Control"],
["Alt", "Alt"],
["Meta", "Meta"]
]), "KEY");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.KEYBOARD);
this.setTooltip("Hold key down");
}
};
Blockly.Blocks['c4a_key_up'] = {
init: function() {
this.appendDummyInput()
.appendField("KEY_UP")
.appendField(new Blockly.FieldDropdown([
["Shift", "Shift"],
["Control", "Control"],
["Alt", "Alt"],
["Meta", "Meta"]
]), "KEY");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.KEYBOARD);
this.setTooltip("Release key");
}
};
// ============================================
// CONTROL FLOW BLOCKS
// ============================================
Blockly.Blocks['c4a_if_exists'] = {
init: function() {
this.appendDummyInput()
.appendField("IF EXISTS")
.appendField("`")
.appendField(new Blockly.FieldTextInput(".element"), "SELECTOR")
.appendField("`")
.appendField("THEN");
this.appendStatementInput("THEN")
.setCheck(null);
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.CONTROL);
this.setTooltip("If element exists, then do something");
}
};
Blockly.Blocks['c4a_if_exists_else'] = {
init: function() {
this.appendDummyInput()
.appendField("IF EXISTS")
.appendField("`")
.appendField(new Blockly.FieldTextInput(".element"), "SELECTOR")
.appendField("`")
.appendField("THEN");
this.appendStatementInput("THEN")
.setCheck(null);
this.appendDummyInput()
.appendField("ELSE");
this.appendStatementInput("ELSE")
.setCheck(null);
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.CONTROL);
this.setTooltip("If element exists, then do something, else do something else");
}
};
Blockly.Blocks['c4a_if_not_exists'] = {
init: function() {
this.appendDummyInput()
.appendField("IF NOT EXISTS")
.appendField("`")
.appendField(new Blockly.FieldTextInput(".element"), "SELECTOR")
.appendField("`")
.appendField("THEN");
this.appendStatementInput("THEN")
.setCheck(null);
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.CONTROL);
this.setTooltip("If element does not exist, then do something");
}
};
Blockly.Blocks['c4a_if_js'] = {
init: function() {
this.appendDummyInput()
.appendField("IF")
.appendField("`")
.appendField(new Blockly.FieldTextInput("window.innerWidth < 768"), "CONDITION")
.appendField("`")
.appendField("THEN");
this.appendStatementInput("THEN")
.setCheck(null);
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.CONTROL);
this.setTooltip("If JavaScript condition is true");
}
};
Blockly.Blocks['c4a_repeat_times'] = {
init: function() {
this.appendDummyInput()
.appendField("REPEAT")
.appendField(new Blockly.FieldNumber(5, 1), "TIMES")
.appendField("times");
this.appendStatementInput("DO")
.setCheck(null);
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.CONTROL);
this.setTooltip("Repeat commands N times");
}
};
Blockly.Blocks['c4a_repeat_while'] = {
init: function() {
this.appendDummyInput()
.appendField("REPEAT WHILE")
.appendField("`")
.appendField(new Blockly.FieldTextInput("document.querySelector('.load-more')"), "CONDITION")
.appendField("`");
this.appendStatementInput("DO")
.setCheck(null);
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.CONTROL);
this.setTooltip("Repeat while condition is true");
}
};
// ============================================
// VARIABLE BLOCKS
// ============================================
Blockly.Blocks['c4a_setvar'] = {
init: function() {
this.appendDummyInput()
.appendField("SETVAR")
.appendField(new Blockly.FieldTextInput("username"), "NAME")
.appendField("=")
.appendField(new Blockly.FieldTextInput("value"), "VALUE");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.VARIABLES);
this.setTooltip("Set variable value");
}
};
// ============================================
// ADVANCED BLOCKS
// ============================================
Blockly.Blocks['c4a_eval'] = {
init: function() {
this.appendDummyInput()
.appendField("EVAL")
.appendField("`")
.appendField(new Blockly.FieldTextInput("console.log('Hello')"), "CODE")
.appendField("`");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.VARIABLES);
this.setTooltip("Execute JavaScript code");
}
};
Blockly.Blocks['c4a_comment'] = {
init: function() {
this.appendDummyInput()
.appendField("#")
.appendField(new Blockly.FieldTextInput("Comment", null, {
spellcheck: false,
class: 'blocklyCommentText'
}), "TEXT");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour("#616161");
this.setTooltip("Add a comment");
this.setStyle('comment_blocks');
}
};
// ============================================
// PROCEDURE BLOCKS
// ============================================
Blockly.Blocks['c4a_proc_def'] = {
init: function() {
this.appendDummyInput()
.appendField("PROC")
.appendField(new Blockly.FieldTextInput("procedure_name"), "NAME");
this.appendStatementInput("BODY")
.setCheck(null);
this.appendDummyInput()
.appendField("ENDPROC");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.PROCEDURES);
this.setTooltip("Define a procedure");
}
};
Blockly.Blocks['c4a_proc_call'] = {
init: function() {
this.appendDummyInput()
.appendField("Call")
.appendField(new Blockly.FieldTextInput("procedure_name"), "NAME");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.PROCEDURES);
this.setTooltip("Call a procedure");
}
};
// Code generators have been moved to c4a-generator.js

View File

@@ -0,0 +1,261 @@
// C4A-Script Code Generator for Blockly
// Compatible with latest Blockly API
// Create a custom code generator for C4A-Script
const c4aGenerator = new Blockly.Generator('C4A');
// Helper to get field value with proper escaping
c4aGenerator.getFieldValue = function(block, fieldName) {
return block.getFieldValue(fieldName);
};
// Navigation generators
c4aGenerator.forBlock['c4a_go'] = function(block, generator) {
const url = generator.getFieldValue(block, 'URL');
return `GO ${url}\n`;
};
c4aGenerator.forBlock['c4a_reload'] = function(block, generator) {
return 'RELOAD\n';
};
c4aGenerator.forBlock['c4a_back'] = function(block, generator) {
return 'BACK\n';
};
c4aGenerator.forBlock['c4a_forward'] = function(block, generator) {
return 'FORWARD\n';
};
// Wait generators
c4aGenerator.forBlock['c4a_wait_time'] = function(block, generator) {
const seconds = generator.getFieldValue(block, 'SECONDS');
return `WAIT ${seconds}\n`;
};
c4aGenerator.forBlock['c4a_wait_selector'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
const timeout = generator.getFieldValue(block, 'TIMEOUT');
return `WAIT \`${selector}\` ${timeout}\n`;
};
c4aGenerator.forBlock['c4a_wait_text'] = function(block, generator) {
const text = generator.getFieldValue(block, 'TEXT');
const timeout = generator.getFieldValue(block, 'TIMEOUT');
return `WAIT "${text}" ${timeout}\n`;
};
// Mouse action generators
c4aGenerator.forBlock['c4a_click'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
return `CLICK \`${selector}\`\n`;
};
c4aGenerator.forBlock['c4a_click_xy'] = function(block, generator) {
const x = generator.getFieldValue(block, 'X');
const y = generator.getFieldValue(block, 'Y');
return `CLICK ${x} ${y}\n`;
};
c4aGenerator.forBlock['c4a_double_click'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
return `DOUBLE_CLICK \`${selector}\`\n`;
};
c4aGenerator.forBlock['c4a_right_click'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
return `RIGHT_CLICK \`${selector}\`\n`;
};
c4aGenerator.forBlock['c4a_move'] = function(block, generator) {
const x = generator.getFieldValue(block, 'X');
const y = generator.getFieldValue(block, 'Y');
return `MOVE ${x} ${y}\n`;
};
c4aGenerator.forBlock['c4a_drag'] = function(block, generator) {
const x1 = generator.getFieldValue(block, 'X1');
const y1 = generator.getFieldValue(block, 'Y1');
const x2 = generator.getFieldValue(block, 'X2');
const y2 = generator.getFieldValue(block, 'Y2');
return `DRAG ${x1} ${y1} ${x2} ${y2}\n`;
};
c4aGenerator.forBlock['c4a_scroll'] = function(block, generator) {
const direction = generator.getFieldValue(block, 'DIRECTION');
const amount = generator.getFieldValue(block, 'AMOUNT');
return `SCROLL ${direction} ${amount}\n`;
};
// Keyboard generators
c4aGenerator.forBlock['c4a_type'] = function(block, generator) {
const text = generator.getFieldValue(block, 'TEXT');
return `TYPE "${text}"\n`;
};
c4aGenerator.forBlock['c4a_type_var'] = function(block, generator) {
const varName = generator.getFieldValue(block, 'VAR');
return `TYPE $${varName}\n`;
};
c4aGenerator.forBlock['c4a_clear'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
return `CLEAR \`${selector}\`\n`;
};
c4aGenerator.forBlock['c4a_set'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
const value = generator.getFieldValue(block, 'VALUE');
return `SET \`${selector}\` "${value}"\n`;
};
c4aGenerator.forBlock['c4a_press'] = function(block, generator) {
const key = generator.getFieldValue(block, 'KEY');
return `PRESS ${key}\n`;
};
c4aGenerator.forBlock['c4a_key_down'] = function(block, generator) {
const key = generator.getFieldValue(block, 'KEY');
return `KEY_DOWN ${key}\n`;
};
c4aGenerator.forBlock['c4a_key_up'] = function(block, generator) {
const key = generator.getFieldValue(block, 'KEY');
return `KEY_UP ${key}\n`;
};
// Control flow generators
c4aGenerator.forBlock['c4a_if_exists'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
const thenCode = generator.statementToCode(block, 'THEN').trim();
if (thenCode.includes('\n')) {
// Multi-line then block
const lines = thenCode.split('\n').filter(line => line.trim());
return lines.map(line => `IF (EXISTS \`${selector}\`) THEN ${line}`).join('\n') + '\n';
} else if (thenCode) {
// Single line
return `IF (EXISTS \`${selector}\`) THEN ${thenCode}\n`;
}
return '';
};
c4aGenerator.forBlock['c4a_if_exists_else'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
const thenCode = generator.statementToCode(block, 'THEN').trim();
const elseCode = generator.statementToCode(block, 'ELSE').trim();
// For simplicity, only handle single-line then/else
const thenLine = thenCode.split('\n')[0];
const elseLine = elseCode.split('\n')[0];
if (thenLine && elseLine) {
return `IF (EXISTS \`${selector}\`) THEN ${thenLine} ELSE ${elseLine}\n`;
} else if (thenLine) {
return `IF (EXISTS \`${selector}\`) THEN ${thenLine}\n`;
}
return '';
};
c4aGenerator.forBlock['c4a_if_not_exists'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
const thenCode = generator.statementToCode(block, 'THEN').trim();
if (thenCode.includes('\n')) {
const lines = thenCode.split('\n').filter(line => line.trim());
return lines.map(line => `IF (NOT EXISTS \`${selector}\`) THEN ${line}`).join('\n') + '\n';
} else if (thenCode) {
return `IF (NOT EXISTS \`${selector}\`) THEN ${thenCode}\n`;
}
return '';
};
c4aGenerator.forBlock['c4a_if_js'] = function(block, generator) {
const condition = generator.getFieldValue(block, 'CONDITION');
const thenCode = generator.statementToCode(block, 'THEN').trim();
if (thenCode.includes('\n')) {
const lines = thenCode.split('\n').filter(line => line.trim());
return lines.map(line => `IF (\`${condition}\`) THEN ${line}`).join('\n') + '\n';
} else if (thenCode) {
return `IF (\`${condition}\`) THEN ${thenCode}\n`;
}
return '';
};
c4aGenerator.forBlock['c4a_repeat_times'] = function(block, generator) {
const times = generator.getFieldValue(block, 'TIMES');
const doCode = generator.statementToCode(block, 'DO').trim();
if (doCode) {
// Get first command for repeat
const firstLine = doCode.split('\n')[0];
return `REPEAT (${firstLine}, ${times})\n`;
}
return '';
};
c4aGenerator.forBlock['c4a_repeat_while'] = function(block, generator) {
const condition = generator.getFieldValue(block, 'CONDITION');
const doCode = generator.statementToCode(block, 'DO').trim();
if (doCode) {
// Get first command for repeat
const firstLine = doCode.split('\n')[0];
return `REPEAT (${firstLine}, \`${condition}\`)\n`;
}
return '';
};
// Variable generators
c4aGenerator.forBlock['c4a_setvar'] = function(block, generator) {
const name = generator.getFieldValue(block, 'NAME');
const value = generator.getFieldValue(block, 'VALUE');
return `SETVAR ${name} = "${value}"\n`;
};
// Advanced generators
c4aGenerator.forBlock['c4a_eval'] = function(block, generator) {
const code = generator.getFieldValue(block, 'CODE');
return `EVAL \`${code}\`\n`;
};
c4aGenerator.forBlock['c4a_comment'] = function(block, generator) {
const text = generator.getFieldValue(block, 'TEXT');
return `# ${text}\n`;
};
// Procedure generators
c4aGenerator.forBlock['c4a_proc_def'] = function(block, generator) {
const name = generator.getFieldValue(block, 'NAME');
const body = generator.statementToCode(block, 'BODY');
return `PROC ${name}\n${body}ENDPROC\n`;
};
c4aGenerator.forBlock['c4a_proc_call'] = function(block, generator) {
const name = generator.getFieldValue(block, 'NAME');
return `${name}\n`;
};
// Override scrub_ to handle our custom format
c4aGenerator.scrub_ = function(block, code, opt_thisOnly) {
const nextBlock = block.nextConnection && block.nextConnection.targetBlock();
let nextCode = '';
if (nextBlock) {
if (!opt_thisOnly) {
nextCode = c4aGenerator.blockToCode(nextBlock);
// Add blank line between comment and non-comment blocks
const currentIsComment = block.type === 'c4a_comment';
const nextIsComment = nextBlock.type === 'c4a_comment';
// Add blank line when transitioning from command to comment or vice versa
if (currentIsComment !== nextIsComment && code.trim() && nextCode.trim()) {
nextCode = '\n' + nextCode;
}
}
}
return code + nextCode;
};

View File

@@ -0,0 +1,531 @@
/* DankMono Font Faces */
@font-face {
font-family: 'DankMono';
src: url('DankMono-Regular.woff2') format('woff2');
font-weight: 400;
font-style: normal;
}
@font-face {
font-family: 'DankMono';
src: url('DankMono-Bold.woff2') format('woff2');
font-weight: 700;
font-style: normal;
}
@font-face {
font-family: 'DankMono';
src: url('DankMono-Italic.woff2') format('woff2');
font-weight: 400;
font-style: italic;
}
/* Root Variables - Matching docs theme */
:root {
--global-font-size: 14px;
--global-code-font-size: 13px;
--global-line-height: 1.5em;
--global-space: 10px;
--font-stack: DankMono, Monaco, Courier New, monospace;
--mono-font-stack: DankMono, Monaco, Courier New, monospace;
--background-color: #070708;
--font-color: #e8e9ed;
--invert-font-color: #222225;
--secondary-color: #d5cec0;
--tertiary-color: #a3abba;
--primary-color: #0fbbaa;
--error-color: #ff3c74;
--progress-bar-background: #3f3f44;
--progress-bar-fill: #09b5a5;
--code-bg-color: #3f3f44;
--block-background-color: #202020;
--header-height: 55px;
}
/* Base Styles */
* {
box-sizing: border-box;
}
body {
margin: 0;
padding: 0;
font-family: var(--font-stack);
font-size: var(--global-font-size);
line-height: var(--global-line-height);
color: var(--font-color);
background-color: var(--background-color);
}
/* Terminal Framework */
.terminal {
min-height: 100vh;
}
.container {
width: 100%;
margin: 0 auto;
}
/* Header */
.header-container {
position: fixed;
top: 0;
left: 0;
right: 0;
height: var(--header-height);
background-color: var(--background-color);
border-bottom: 1px solid var(--progress-bar-background);
z-index: 1000;
padding: 0 calc(var(--global-space) * 2);
}
.terminal-nav {
display: flex;
align-items: center;
justify-content: space-between;
height: 100%;
}
.terminal-logo h1 {
margin: 0;
font-size: 1.2em;
color: var(--primary-color);
font-weight: 400;
}
.terminal-menu ul {
list-style: none;
margin: 0;
padding: 0;
display: flex;
gap: 2em;
}
.terminal-menu a {
color: var(--secondary-color);
text-decoration: none;
transition: color 0.2s;
}
.terminal-menu a:hover,
.terminal-menu a.active {
color: var(--primary-color);
}
/* Main Container */
.main-container {
padding-top: calc(var(--header-height) + 2em);
padding-left: 2em;
padding-right: 2em;
max-width: 1400px;
margin: 0 auto;
}
/* Tutorial Grid */
.tutorial-grid {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 2em;
align-items: start;
}
/* Terminal Cards */
.terminal-card {
background-color: var(--block-background-color);
border: 1px solid var(--progress-bar-background);
margin-bottom: 1.5em;
}
.terminal-card header {
background-color: var(--progress-bar-background);
padding: 0.8em 1em;
font-weight: 700;
color: var(--font-color);
display: flex;
justify-content: space-between;
align-items: center;
}
.terminal-card > div {
padding: 1.5em;
}
/* Editor Section */
.editor-controls {
display: flex;
gap: 0.5em;
}
.editor-container {
height: 300px;
overflow: hidden;
}
#c4a-editor {
width: 100%;
height: 100%;
font-family: var(--mono-font-stack);
font-size: var(--global-code-font-size);
background-color: var(--code-bg-color);
color: var(--font-color);
border: none;
padding: 1em;
resize: none;
}
/* JS Output */
.js-output-container {
max-height: 300px;
overflow-y: auto;
}
.js-output-container pre {
margin: 0;
padding: 1em;
background-color: var(--code-bg-color);
}
.js-output-container code {
font-family: var(--mono-font-stack);
font-size: var(--global-code-font-size);
color: var(--font-color);
white-space: pre-wrap;
}
/* Console Output */
.console-output {
font-family: var(--mono-font-stack);
font-size: var(--global-code-font-size);
max-height: 200px;
overflow-y: auto;
padding: 1em;
}
.console-line {
margin-bottom: 0.5em;
}
.console-prompt {
color: var(--primary-color);
margin-right: 0.5em;
}
.console-text {
color: var(--font-color);
}
.console-error {
color: var(--error-color);
}
.console-success {
color: var(--primary-color);
}
/* Playground */
.playground-container {
height: 600px;
background-color: #fff;
border: 1px solid var(--progress-bar-background);
}
#playground-frame {
width: 100%;
height: 100%;
border: none;
}
/* Execution Progress */
.execution-progress {
padding: 1em;
}
.progress-item {
display: flex;
align-items: center;
gap: 0.8em;
margin-bottom: 0.8em;
color: var(--secondary-color);
}
.progress-item.active {
color: var(--primary-color);
}
.progress-item.completed {
color: var(--tertiary-color);
}
.progress-item.error {
color: var(--error-color);
}
.progress-icon {
font-size: 1.2em;
}
/* Buttons */
.btn {
background-color: var(--primary-color);
color: var(--background-color);
border: none;
padding: 0.5em 1em;
font-family: var(--font-stack);
font-size: 0.9em;
cursor: pointer;
transition: all 0.2s;
}
.btn:hover {
background-color: var(--progress-bar-fill);
}
.btn-sm {
padding: 0.3em 0.8em;
font-size: 0.85em;
}
.btn-ghost {
background-color: transparent;
color: var(--secondary-color);
border: 1px solid var(--progress-bar-background);
}
.btn-ghost:hover {
background-color: var(--progress-bar-background);
color: var(--font-color);
}
/* Scrollbars */
::-webkit-scrollbar {
width: 8px;
height: 8px;
}
::-webkit-scrollbar-track {
background: var(--block-background-color);
}
::-webkit-scrollbar-thumb {
background: var(--progress-bar-background);
}
::-webkit-scrollbar-thumb:hover {
background: var(--secondary-color);
}
/* CodeMirror Theme Override */
.CodeMirror {
font-family: var(--mono-font-stack) !important;
font-size: var(--global-code-font-size) !important;
background-color: var(--code-bg-color) !important;
color: var(--font-color) !important;
height: 100% !important;
}
.CodeMirror-gutters {
background-color: var(--progress-bar-background) !important;
border-right: 1px solid var(--progress-bar-background) !important;
}
/* Responsive */
@media (max-width: 1200px) {
.tutorial-grid {
grid-template-columns: 1fr;
}
.playground-section {
order: -1;
}
}
/* Links */
a {
color: var(--primary-color);
text-decoration: none;
}
a:hover {
text-decoration: underline;
}
/* Lists */
ul, ol {
padding-left: 2em;
}
li {
margin-bottom: 0.5em;
}
/* Code */
code {
background-color: var(--code-bg-color);
padding: 0.2em 0.4em;
font-family: var(--mono-font-stack);
font-size: 0.9em;
}
/* Headings */
h1, h2, h3, h4, h5, h6 {
font-weight: 700;
margin-top: 1.5em;
margin-bottom: 0.8em;
}
h3 {
color: var(--primary-color);
font-size: 1.1em;
}
/* Tutorial Panel */
.tutorial-panel {
position: absolute;
top: 60px;
right: 20px;
width: 380px;
background: #1a1a1b;
border: 1px solid #2a2a2c;
border-radius: 8px;
box-shadow: 0 8px 32px rgba(0, 0, 0, 0.5);
z-index: 1000;
transition: all 0.3s ease;
}
.tutorial-panel.hidden {
display: none;
}
.tutorial-header {
display: flex;
justify-content: space-between;
align-items: center;
padding: 16px 20px;
border-bottom: 1px solid #2a2a2c;
}
.tutorial-header h3 {
margin: 0;
color: #0fbbaa;
font-size: 18px;
}
.close-btn {
background: none;
border: none;
color: #8b8b8d;
font-size: 24px;
cursor: pointer;
padding: 0;
width: 30px;
height: 30px;
display: flex;
align-items: center;
justify-content: center;
border-radius: 4px;
transition: all 0.2s;
}
.close-btn:hover {
background: #2a2a2c;
color: #e0e0e0;
}
.tutorial-content {
padding: 20px;
}
.tutorial-content p {
margin: 0 0 16px 0;
color: #e0e0e0;
line-height: 1.6;
}
.tutorial-progress {
margin-top: 16px;
}
.tutorial-progress span {
display: block;
margin-bottom: 8px;
color: #8b8b8d;
font-size: 12px;
text-transform: uppercase;
}
.progress-bar {
height: 4px;
background: #2a2a2c;
border-radius: 2px;
overflow: hidden;
}
.progress-fill {
height: 100%;
background: #0fbbaa;
transition: width 0.3s ease;
}
.tutorial-actions {
display: flex;
gap: 12px;
padding: 0 20px 20px;
}
.tutorial-btn {
flex: 1;
padding: 10px 16px;
background: #2a2a2c;
color: #e0e0e0;
border: 1px solid #3a3a3c;
border-radius: 6px;
font-size: 14px;
cursor: pointer;
transition: all 0.2s;
}
.tutorial-btn:hover:not(:disabled) {
background: #3a3a3c;
transform: translateY(-1px);
}
.tutorial-btn:disabled {
opacity: 0.5;
cursor: not-allowed;
}
.tutorial-btn.primary {
background: #0fbbaa;
color: #070708;
border-color: #0fbbaa;
}
.tutorial-btn.primary:hover {
background: #0da89a;
border-color: #0da89a;
}
/* Tutorial Highlights */
.tutorial-highlight {
position: relative;
animation: pulse 2s infinite;
}
@keyframes pulse {
0% {
box-shadow: 0 0 0 0 rgba(15, 187, 170, 0.4);
}
50% {
box-shadow: 0 0 0 10px rgba(15, 187, 170, 0);
}
100% {
box-shadow: 0 0 0 0 rgba(15, 187, 170, 0);
}
}
.editor-card {
position: relative;
}

View File

@@ -0,0 +1,21 @@
# Demo: Login Flow with Blockly
# This script can be created visually using Blockly blocks
GO https://example.com/login
WAIT `#login-form` 5
# Check if already logged in
IF (EXISTS `.user-avatar`) THEN GO https://example.com/dashboard
# Fill login form
CLICK `#email`
TYPE "demo@example.com"
CLICK `#password`
TYPE "password123"
# Submit form
CLICK `button[type="submit"]`
WAIT `.dashboard` 10
# Success message
EVAL `console.log('Login successful!')`

View File

@@ -0,0 +1,205 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>C4A-Script Interactive Tutorial | Crawl4AI</title>
<link rel="stylesheet" href="assets/app.css">
<link rel="stylesheet" href="assets/blockly-theme.css">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/codemirror/6.65.7/codemirror.min.css">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/codemirror/6.65.7/theme/material-darker.min.css">
</head>
<body>
<!-- Tutorial Intro Modal -->
<div id="tutorial-intro" class="tutorial-intro-modal">
<div class="intro-content">
<h2>Welcome to C4A-Script Tutorial!</h2>
<p>C4A-Script is a simple language for web automation. This interactive tutorial will teach you:</p>
<ul>
<li>How to handle popups and banners</li>
<li>Form filling and navigation</li>
<li>Advanced automation techniques</li>
</ul>
<div class="intro-actions">
<button id="start-tutorial-btn" class="intro-btn primary">Start Tutorial</button>
<button id="skip-tutorial-btn" class="intro-btn">Skip</button>
</div>
</div>
</div>
<!-- Event Editor Modal -->
<div id="event-editor-overlay" class="modal-overlay hidden"></div>
<div id="event-editor-modal" class="event-editor-modal hidden">
<h4>Edit Event</h4>
<div class="editor-field">
<label>Command Type</label>
<select id="edit-command-type" disabled>
<option value="CLICK">CLICK</option>
<option value="DOUBLE_CLICK">DOUBLE_CLICK</option>
<option value="RIGHT_CLICK">RIGHT_CLICK</option>
<option value="TYPE">TYPE</option>
<option value="SET">SET</option>
<option value="SCROLL">SCROLL</option>
<option value="WAIT">WAIT</option>
</select>
</div>
<div id="edit-selector-field" class="editor-field">
<label>Selector</label>
<input type="text" id="edit-selector" placeholder=".class or #id">
</div>
<div id="edit-value-field" class="editor-field">
<label>Value</label>
<input type="text" id="edit-value" placeholder="Text or number">
</div>
<div id="edit-direction-field" class="editor-field hidden">
<label>Direction</label>
<select id="edit-direction">
<option value="UP">UP</option>
<option value="DOWN">DOWN</option>
<option value="LEFT">LEFT</option>
<option value="RIGHT">RIGHT</option>
</select>
</div>
<div class="editor-actions">
<button id="edit-cancel" class="mini-btn">Cancel</button>
<button id="edit-save" class="mini-btn primary">Save</button>
</div>
</div>
<!-- Main App Layout -->
<div class="app-container">
<!-- Left Panel: Editor -->
<div class="editor-panel">
<div class="panel-header">
<h2>C4A-Script Editor</h2>
<div class="header-actions">
<button id="tutorial-btn" class="action-btn" title="Tutorial">
<span class="icon">📚</span>
</button>
<button id="examples-btn" class="action-btn" title="Examples">
<span class="icon">📋</span>
</button>
<button id="clear-btn" class="action-btn" title="Clear">
<span class="icon">🗑</span>
</button>
<button id="run-btn" class="action-btn primary">
<span class="icon"></span>Run
</button>
<button id="record-btn" class="action-btn record">
<span class="icon"></span>Record
</button>
<button id="timeline-btn" class="action-btn timeline hidden" title="View Timeline">
<span class="icon">📊</span>
</button>
</div>
</div>
<div class="editor-container">
<div id="editor-view" class="editor-wrapper">
<textarea id="c4a-editor" placeholder="# Write your C4A script here..."></textarea>
</div>
<!-- Recording Timeline -->
<div id="timeline-view" class="recording-timeline hidden">
<div class="timeline-header">
<h3>Recording Timeline</h3>
<div class="timeline-actions">
<button id="back-to-editor" class="mini-btn">← Back</button>
<button id="select-all-events" class="mini-btn">Select All</button>
<button id="clear-events" class="mini-btn">Clear</button>
<button id="generate-script" class="mini-btn primary">Generate Script</button>
</div>
</div>
<div id="timeline-events" class="timeline-events">
<!-- Events will be added here dynamically -->
</div>
</div>
</div>
<!-- Bottom: Output Tabs -->
<div class="output-section">
<div class="tabs">
<button class="tab active" data-tab="console">Console</button>
<button class="tab" data-tab="javascript">Generated JS</button>
</div>
<div class="tab-content">
<div id="console-tab" class="tab-pane active">
<div id="console-output" class="console">
<div class="console-line">
<span class="console-prompt">$</span>
<span class="console-text">Ready to run C4A scripts...</span>
</div>
</div>
</div>
<div id="javascript-tab" class="tab-pane">
<div class="js-output-header">
<div class="js-actions">
<button id="copy-js-btn" class="mini-btn" title="Copy">
<span>📋</span>
</button>
<button id="edit-js-btn" class="mini-btn" title="Edit">
<span>✏️</span>
</button>
</div>
</div>
<pre id="js-output" class="js-output">// JavaScript will appear here...</pre>
</div>
</div>
</div>
</div>
<!-- Right Panel: Playground -->
<div class="playground-panel">
<div class="panel-header">
<h2>Playground</h2>
<div class="header-actions">
<button id="reset-playground" class="action-btn" title="Reset">
<span class="icon">🔄</span>
</button>
<button id="fullscreen-btn" class="action-btn" title="Fullscreen">
<span class="icon"></span>
</button>
</div>
</div>
<div class="playground-wrapper">
<iframe id="playground-frame" src="playground/" title="Playground"></iframe>
</div>
</div>
</div>
<!-- Tutorial Navigation Bar -->
<div id="tutorial-nav" class="tutorial-nav hidden">
<div class="tutorial-nav-content">
<div class="tutorial-left">
<div class="tutorial-step-title">
<span id="tutorial-step-info">Step 1 of 9</span>
<span id="tutorial-title">Welcome</span>
</div>
<p id="tutorial-description" class="tutorial-description">Let's start by waiting for the page to load.</p>
</div>
<div class="tutorial-right">
<div class="tutorial-controls">
<button id="tutorial-prev" class="nav-btn" disabled>← Previous</button>
<button id="tutorial-next" class="nav-btn primary">Next →</button>
</div>
<button id="tutorial-exit" class="exit-btn" title="Exit Tutorial">×</button>
</div>
</div>
<div class="tutorial-progress-bar">
<div id="tutorial-progress-fill" class="progress-fill"></div>
</div>
</div>
<!-- Scripts -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/codemirror/6.65.7/codemirror.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/codemirror/6.65.7/mode/javascript/javascript.min.js"></script>
<!-- Blockly -->
<script src="https://unpkg.com/blockly/blockly.min.js"></script>
<script src="assets/c4a-blocks.js"></script>
<script src="assets/c4a-generator.js"></script>
<script src="assets/blockly-manager.js"></script>
<script src="assets/app.js"></script>
</body>
</html>

View File

@@ -0,0 +1,604 @@
// Playground App JavaScript
class PlaygroundApp {
constructor() {
this.isLoggedIn = false;
this.currentSection = 'home';
this.productsLoaded = 0;
this.maxProducts = 100;
this.tableRowsLoaded = 10;
this.inspectorMode = false;
this.tooltip = null;
this.init();
}
init() {
this.setupCookieBanner();
this.setupNewsletterPopup();
this.setupNavigation();
this.setupAuth();
this.setupProductCatalog();
this.setupForms();
this.setupTabs();
this.setupDataTable();
this.setupInspector();
this.loadInitialData();
}
// Cookie Banner
setupCookieBanner() {
const banner = document.getElementById('cookie-banner');
const acceptBtn = banner.querySelector('.accept');
const declineBtn = banner.querySelector('.decline');
acceptBtn.addEventListener('click', () => {
banner.style.display = 'none';
console.log('✅ Cookies accepted');
});
declineBtn.addEventListener('click', () => {
banner.style.display = 'none';
console.log('❌ Cookies declined');
});
}
// Newsletter Popup
setupNewsletterPopup() {
const popup = document.getElementById('newsletter-popup');
const closeBtn = popup.querySelector('.close');
const subscribeBtn = popup.querySelector('.subscribe');
// Show popup after 3 seconds
setTimeout(() => {
popup.style.display = 'flex';
}, 3000);
closeBtn.addEventListener('click', () => {
popup.style.display = 'none';
});
subscribeBtn.addEventListener('click', () => {
const email = popup.querySelector('input').value;
if (email) {
console.log(`📧 Subscribed: ${email}`);
popup.style.display = 'none';
}
});
// Close on outside click
popup.addEventListener('click', (e) => {
if (e.target === popup) {
popup.style.display = 'none';
}
});
}
// Navigation
setupNavigation() {
const navLinks = document.querySelectorAll('.nav-link');
const sections = document.querySelectorAll('.section');
navLinks.forEach(link => {
link.addEventListener('click', (e) => {
e.preventDefault();
const targetId = link.getAttribute('href').substring(1);
// Update active states
navLinks.forEach(l => l.classList.remove('active'));
link.classList.add('active');
// Show target section
sections.forEach(s => s.classList.remove('active'));
const targetSection = document.getElementById(targetId);
if (targetSection) {
targetSection.classList.add('active');
this.currentSection = targetId;
// Load content for specific sections
this.loadSectionContent(targetId);
}
});
});
// Start tutorial button
const startBtn = document.getElementById('start-tutorial');
if (startBtn) {
startBtn.addEventListener('click', () => {
console.log('🚀 Tutorial started!');
alert('Tutorial started! Check the console for progress.');
});
}
}
// Authentication
setupAuth() {
const loginBtn = document.getElementById('login-btn');
const logoutBtn = document.getElementById('logout-btn');
const loginModal = document.getElementById('login-modal');
const loginForm = document.getElementById('login-form');
const closeBtn = loginModal.querySelector('.close');
loginBtn.addEventListener('click', () => {
loginModal.style.display = 'flex';
});
closeBtn.addEventListener('click', () => {
loginModal.style.display = 'none';
});
loginForm.addEventListener('submit', (e) => {
e.preventDefault();
const email = document.getElementById('email').value;
const password = document.getElementById('password').value;
const rememberMe = document.getElementById('remember-me').checked;
const messageEl = document.getElementById('login-message');
// Simple validation
if (email === 'demo@example.com' && password === 'demo123') {
this.isLoggedIn = true;
messageEl.textContent = '✅ Login successful!';
messageEl.className = 'form-message success';
setTimeout(() => {
loginModal.style.display = 'none';
document.getElementById('login-btn').style.display = 'none';
document.getElementById('user-info').style.display = 'flex';
document.getElementById('username-display').textContent = 'Demo User';
console.log(`✅ Logged in${rememberMe ? ' (remembered)' : ''}`);
}, 1000);
} else {
messageEl.textContent = '❌ Invalid credentials. Try demo@example.com / demo123';
messageEl.className = 'form-message error';
}
});
logoutBtn.addEventListener('click', () => {
this.isLoggedIn = false;
document.getElementById('login-btn').style.display = 'block';
document.getElementById('user-info').style.display = 'none';
console.log('👋 Logged out');
});
// Close modal on outside click
loginModal.addEventListener('click', (e) => {
if (e.target === loginModal) {
loginModal.style.display = 'none';
}
});
}
// Product Catalog
setupProductCatalog() {
// View toggle
const infiniteBtn = document.getElementById('infinite-scroll-btn');
const paginationBtn = document.getElementById('pagination-btn');
const infiniteView = document.getElementById('infinite-scroll-view');
const paginationView = document.getElementById('pagination-view');
infiniteBtn.addEventListener('click', () => {
infiniteBtn.classList.add('active');
paginationBtn.classList.remove('active');
infiniteView.style.display = 'block';
paginationView.style.display = 'none';
this.setupInfiniteScroll();
});
paginationBtn.addEventListener('click', () => {
paginationBtn.classList.add('active');
infiniteBtn.classList.remove('active');
paginationView.style.display = 'block';
infiniteView.style.display = 'none';
});
// Load more button
const loadMoreBtn = paginationView.querySelector('.load-more');
loadMoreBtn.addEventListener('click', () => {
this.loadMoreProducts();
});
// Collapsible filters
const collapsibles = document.querySelectorAll('.collapsible');
collapsibles.forEach(header => {
header.addEventListener('click', () => {
const content = header.nextElementSibling;
const toggle = header.querySelector('.toggle');
content.style.display = content.style.display === 'none' ? 'block' : 'none';
toggle.textContent = content.style.display === 'none' ? '▶' : '▼';
});
});
}
setupInfiniteScroll() {
const container = document.querySelector('.products-container');
const loadingIndicator = document.getElementById('loading-indicator');
container.addEventListener('scroll', () => {
if (container.scrollTop + container.clientHeight >= container.scrollHeight - 100) {
if (this.productsLoaded < this.maxProducts) {
loadingIndicator.style.display = 'block';
setTimeout(() => {
this.loadMoreProducts();
loadingIndicator.style.display = 'none';
}, 1000);
}
}
});
}
loadMoreProducts() {
const grid = document.getElementById('product-grid');
const batch = 10;
for (let i = 0; i < batch && this.productsLoaded < this.maxProducts; i++) {
const product = this.createProductCard(this.productsLoaded + 1);
grid.appendChild(product);
this.productsLoaded++;
}
console.log(`📦 Loaded ${batch} more products. Total: ${this.productsLoaded}`);
}
createProductCard(id) {
const card = document.createElement('div');
card.className = 'product-card';
card.innerHTML = `
<div class="product-image">📦</div>
<div class="product-name">Product ${id}</div>
<div class="product-price">$${(Math.random() * 100 + 10).toFixed(2)}</div>
<button class="btn btn-sm">Quick View</button>
`;
// Quick view functionality
const quickViewBtn = card.querySelector('button');
quickViewBtn.addEventListener('click', () => {
alert(`Quick view for Product ${id}`);
});
return card;
}
// Forms
setupForms() {
// Contact Form
const contactForm = document.getElementById('contact-form');
const subjectSelect = document.getElementById('contact-subject');
const departmentGroup = document.getElementById('department-group');
const departmentSelect = document.getElementById('department');
subjectSelect.addEventListener('change', () => {
if (subjectSelect.value === 'support') {
departmentGroup.style.display = 'block';
departmentSelect.innerHTML = `
<option value="">Select department</option>
<option value="technical">Technical Support</option>
<option value="billing">Billing Support</option>
<option value="general">General Support</option>
`;
} else {
departmentGroup.style.display = 'none';
}
});
contactForm.addEventListener('submit', (e) => {
e.preventDefault();
const messageDisplay = document.getElementById('contact-message-display');
messageDisplay.textContent = '✅ Message sent successfully!';
messageDisplay.className = 'form-message success';
console.log('📧 Contact form submitted');
});
// Multi-step Form
const surveyForm = document.getElementById('survey-form');
const steps = surveyForm.querySelectorAll('.form-step');
const progressFill = document.getElementById('progress-fill');
let currentStep = 1;
surveyForm.addEventListener('click', (e) => {
if (e.target.classList.contains('next-step')) {
if (currentStep < 3) {
steps[currentStep - 1].style.display = 'none';
currentStep++;
steps[currentStep - 1].style.display = 'block';
progressFill.style.width = `${(currentStep / 3) * 100}%`;
}
} else if (e.target.classList.contains('prev-step')) {
if (currentStep > 1) {
steps[currentStep - 1].style.display = 'none';
currentStep--;
steps[currentStep - 1].style.display = 'block';
progressFill.style.width = `${(currentStep / 3) * 100}%`;
}
}
});
surveyForm.addEventListener('submit', (e) => {
e.preventDefault();
document.getElementById('survey-success').style.display = 'block';
console.log('📋 Survey submitted successfully!');
});
}
// Tabs
setupTabs() {
const tabBtns = document.querySelectorAll('.tab-btn');
const tabPanes = document.querySelectorAll('.tab-pane');
tabBtns.forEach(btn => {
btn.addEventListener('click', () => {
const targetTab = btn.getAttribute('data-tab');
// Update active states
tabBtns.forEach(b => b.classList.remove('active'));
btn.classList.add('active');
// Show target pane
tabPanes.forEach(pane => {
pane.style.display = pane.id === targetTab ? 'block' : 'none';
});
});
});
// Show more functionality
const showMoreBtn = document.querySelector('.show-more');
const hiddenText = document.querySelector('.hidden-text');
if (showMoreBtn) {
showMoreBtn.addEventListener('click', () => {
if (hiddenText.style.display === 'none') {
hiddenText.style.display = 'block';
showMoreBtn.textContent = 'Show Less';
} else {
hiddenText.style.display = 'none';
showMoreBtn.textContent = 'Show More';
}
});
}
// Load comments
const loadCommentsBtn = document.querySelector('.load-comments');
const commentsSection = document.querySelector('.comments-section');
if (loadCommentsBtn) {
loadCommentsBtn.addEventListener('click', () => {
commentsSection.style.display = 'block';
commentsSection.innerHTML = `
<div class="comment">
<div class="comment-author">John Doe</div>
<div class="comment-text">Great product! Highly recommended.</div>
</div>
<div class="comment">
<div class="comment-author">Jane Smith</div>
<div class="comment-text">Excellent quality and fast shipping.</div>
</div>
`;
loadCommentsBtn.style.display = 'none';
console.log('💬 Comments loaded');
});
}
}
// Data Table
setupDataTable() {
const loadMoreBtn = document.querySelector('.load-more-rows');
const searchInput = document.querySelector('.search-input');
const exportBtn = document.getElementById('export-btn');
const sortableHeaders = document.querySelectorAll('.sortable');
// Load more rows
loadMoreBtn.addEventListener('click', () => {
this.loadMoreTableRows();
});
// Search functionality
searchInput.addEventListener('input', (e) => {
const searchTerm = e.target.value.toLowerCase();
const rows = document.querySelectorAll('#table-body tr');
rows.forEach(row => {
const text = row.textContent.toLowerCase();
row.style.display = text.includes(searchTerm) ? '' : 'none';
});
});
// Export functionality
exportBtn.addEventListener('click', () => {
console.log('📊 Exporting table data...');
alert('Table data exported! (Check console)');
});
// Sorting
sortableHeaders.forEach(header => {
header.addEventListener('click', () => {
console.log(`🔄 Sorting by ${header.getAttribute('data-sort')}`);
});
});
}
loadMoreTableRows() {
const tbody = document.getElementById('table-body');
const batch = 10;
for (let i = 0; i < batch; i++) {
const row = document.createElement('tr');
const id = this.tableRowsLoaded + i + 1;
row.innerHTML = `
<td>User ${id}</td>
<td>user${id}@example.com</td>
<td>${new Date().toLocaleDateString()}</td>
<td><button class="btn btn-sm">Edit</button></td>
`;
tbody.appendChild(row);
}
this.tableRowsLoaded += batch;
console.log(`📄 Loaded ${batch} more rows. Total: ${this.tableRowsLoaded}`);
}
// Load initial data
loadInitialData() {
// Load initial products
this.loadMoreProducts();
// Load initial table rows
this.loadMoreTableRows();
}
// Load content when navigating to sections
loadSectionContent(sectionId) {
switch(sectionId) {
case 'catalog':
// Ensure products are loaded in catalog
if (this.productsLoaded === 0) {
this.loadMoreProducts();
}
break;
case 'data-tables':
// Ensure table rows are loaded
if (this.tableRowsLoaded === 0) {
this.loadMoreTableRows();
}
break;
case 'forms':
// Forms are already set up
break;
case 'tabs':
// Tabs content is static
break;
}
}
// Inspector Mode
setupInspector() {
const inspectorBtn = document.getElementById('inspector-btn');
// Create tooltip element
this.tooltip = document.createElement('div');
this.tooltip.className = 'inspector-tooltip';
this.tooltip.style.cssText = `
position: fixed;
background: rgba(0, 0, 0, 0.9);
color: white;
padding: 8px 12px;
border-radius: 4px;
font-size: 12px;
font-family: monospace;
pointer-events: none;
z-index: 10000;
display: none;
max-width: 300px;
`;
document.body.appendChild(this.tooltip);
inspectorBtn.addEventListener('click', () => {
this.toggleInspector();
});
// Add mouse event listeners
document.addEventListener('mousemove', this.handleMouseMove.bind(this));
document.addEventListener('mouseout', this.handleMouseOut.bind(this));
}
toggleInspector() {
this.inspectorMode = !this.inspectorMode;
const inspectorBtn = document.getElementById('inspector-btn');
if (this.inspectorMode) {
inspectorBtn.classList.add('active');
inspectorBtn.style.background = '#0fbbaa';
document.body.style.cursor = 'crosshair';
} else {
inspectorBtn.classList.remove('active');
inspectorBtn.style.background = '';
document.body.style.cursor = '';
this.tooltip.style.display = 'none';
this.removeHighlight();
}
}
handleMouseMove(e) {
if (!this.inspectorMode) return;
const element = e.target;
if (element === this.tooltip) return;
// Highlight element
this.highlightElement(element);
// Show tooltip with element info
const info = this.getElementInfo(element);
this.tooltip.innerHTML = info;
this.tooltip.style.display = 'block';
// Position tooltip
const x = e.clientX + 15;
const y = e.clientY + 15;
// Adjust position if tooltip would go off screen
const rect = this.tooltip.getBoundingClientRect();
const adjustedX = x + rect.width > window.innerWidth ? x - rect.width - 30 : x;
const adjustedY = y + rect.height > window.innerHeight ? y - rect.height - 30 : y;
this.tooltip.style.left = adjustedX + 'px';
this.tooltip.style.top = adjustedY + 'px';
}
handleMouseOut(e) {
if (!this.inspectorMode) return;
if (e.target === document.body) {
this.removeHighlight();
this.tooltip.style.display = 'none';
}
}
highlightElement(element) {
this.removeHighlight();
element.style.outline = '2px solid #0fbbaa';
element.style.outlineOffset = '1px';
element.setAttribute('data-inspector-highlighted', 'true');
}
removeHighlight() {
const highlighted = document.querySelector('[data-inspector-highlighted]');
if (highlighted) {
highlighted.style.outline = '';
highlighted.style.outlineOffset = '';
highlighted.removeAttribute('data-inspector-highlighted');
}
}
getElementInfo(element) {
const tagName = element.tagName.toLowerCase();
const id = element.id ? `#${element.id}` : '';
const classes = element.className ?
`.${element.className.split(' ').filter(c => c).join('.')}` : '';
let selector = tagName;
if (id) {
selector = id;
} else if (classes) {
selector = `${tagName}${classes}`;
}
// Build info HTML
let info = `<strong>${selector}</strong>`;
// Add additional attributes
const attrs = [];
if (element.name) attrs.push(`name="${element.name}"`);
if (element.type) attrs.push(`type="${element.type}"`);
if (element.href) attrs.push(`href="${element.href}"`);
if (element.value && element.tagName === 'INPUT') attrs.push(`value="${element.value}"`);
if (attrs.length > 0) {
info += `<br><span style="color: #888;">${attrs.join(' ')}</span>`;
}
return info;
}
}
// Initialize app when DOM is ready
document.addEventListener('DOMContentLoaded', () => {
window.playgroundApp = new PlaygroundApp();
console.log('🎮 Playground app initialized!');
});

View File

@@ -0,0 +1,328 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>C4A-Script Playground</title>
<link rel="stylesheet" href="styles.css">
</head>
<body>
<!-- Cookie Banner -->
<div class="cookie-banner" id="cookie-banner">
<div class="cookie-content">
<p>🍪 We use cookies to enhance your experience. By continuing, you agree to our cookie policy.</p>
<div class="cookie-actions">
<button class="btn accept">Accept All</button>
<button class="btn btn-secondary decline">Decline</button>
</div>
</div>
</div>
<!-- Newsletter Popup (appears after 3 seconds) -->
<div class="modal" id="newsletter-popup" style="display: none;">
<div class="modal-content">
<span class="close">&times;</span>
<h2>📬 Subscribe to Our Newsletter</h2>
<p>Get the latest updates on web automation!</p>
<input type="email" placeholder="Enter your email" class="input">
<button class="btn subscribe">Subscribe</button>
</div>
</div>
<!-- Header -->
<header class="site-header">
<nav class="nav-menu">
<a href="#home" class="nav-link active">Home</a>
<a href="#catalog" class="nav-link" id="catalog-link">Products</a>
<a href="#forms" class="nav-link">Forms</a>
<a href="#data-tables" class="nav-link">Data Tables</a>
<div class="dropdown">
<a href="#" class="nav-link dropdown-toggle">More ▼</a>
<div class="dropdown-content">
<a href="#tabs">Tabs Demo</a>
<a href="#accordion">FAQ</a>
<a href="#gallery">Gallery</a>
</div>
</div>
</nav>
<div class="auth-section">
<button class="btn btn-sm" id="inspector-btn" title="Toggle Inspector">🔍</button>
<button class="btn btn-sm" id="login-btn">Login</button>
<div class="user-info" id="user-info" style="display: none;">
<span class="user-avatar">👤</span>
<span class="welcome-message">Welcome, <span id="username-display">User</span>!</span>
<button class="btn btn-sm btn-secondary" id="logout-btn">Logout</button>
</div>
</div>
</header>
<!-- Main Content -->
<main class="main-content">
<!-- Home Section -->
<section id="home" class="section active">
<h1>Welcome to C4A-Script Playground</h1>
<p>This is an interactive demo for testing C4A-Script commands. Each section contains different challenges for web automation.</p>
<button class="btn btn-primary" id="start-tutorial">Start Tutorial</button>
<div class="feature-grid">
<div class="feature-card">
<h3>🔐 Authentication</h3>
<p>Test login forms and user sessions</p>
</div>
<div class="feature-card">
<h3>📜 Dynamic Content</h3>
<p>Infinite scroll and pagination</p>
</div>
<div class="feature-card">
<h3>📝 Forms</h3>
<p>Complex form interactions</p>
</div>
<div class="feature-card">
<h3>📊 Data Tables</h3>
<p>Sortable and filterable data</p>
</div>
</div>
</section>
<!-- Login Modal -->
<div class="modal" id="login-modal" style="display: none;">
<div class="modal-content login-form">
<span class="close">&times;</span>
<h2>Login</h2>
<form id="login-form">
<div class="form-group">
<label>Email</label>
<input type="email" id="email" class="input" placeholder="demo@example.com">
</div>
<div class="form-group">
<label>Password</label>
<input type="password" id="password" class="input" placeholder="demo123">
</div>
<div class="form-group">
<label class="checkbox-label">
<input type="checkbox" id="remember-me">
Remember me
</label>
</div>
<button type="submit" class="btn btn-primary">Login</button>
<div class="form-message" id="login-message"></div>
</form>
</div>
</div>
<!-- Product Catalog Section -->
<section id="catalog" class="section">
<h1>Product Catalog</h1>
<div class="view-toggle">
<button class="btn btn-sm active" id="infinite-scroll-btn">Infinite Scroll</button>
<button class="btn btn-sm" id="pagination-btn">Pagination</button>
</div>
<!-- Filters Sidebar -->
<div class="catalog-layout">
<aside class="filters-sidebar">
<h3>Filters</h3>
<div class="filter-group">
<h4 class="collapsible">Category <span class="toggle"></span></h4>
<div class="filter-content">
<label><input type="checkbox"> Electronics</label>
<label><input type="checkbox"> Clothing</label>
<label><input type="checkbox"> Books</label>
</div>
</div>
<div class="filter-group">
<h4 class="collapsible">Price Range <span class="toggle"></span></h4>
<div class="filter-content">
<input type="range" min="0" max="1000" value="500">
<span>$0 - $500</span>
</div>
</div>
</aside>
<!-- Products Grid -->
<div class="products-container">
<div class="product-grid" id="product-grid">
<!-- Products will be loaded here -->
</div>
<!-- Infinite Scroll View -->
<div id="infinite-scroll-view" class="view-mode">
<div class="loading-indicator" id="loading-indicator" style="display: none;">
<div class="spinner"></div>
<p>Loading more products...</p>
</div>
</div>
<!-- Pagination View -->
<div id="pagination-view" class="view-mode" style="display: none;">
<button class="btn load-more">Load More</button>
<div class="pagination">
<button class="page-btn">1</button>
<button class="page-btn">2</button>
<button class="page-btn">3</button>
</div>
</div>
</div>
</div>
</section>
<!-- Forms Section -->
<section id="forms" class="section">
<h1>Form Examples</h1>
<!-- Contact Form -->
<div class="form-card">
<h2>Contact Form</h2>
<form id="contact-form">
<div class="form-group">
<label>Name</label>
<input type="text" class="input" id="contact-name">
</div>
<div class="form-group">
<label>Email</label>
<input type="email" class="input" id="contact-email">
</div>
<div class="form-group">
<label>Subject</label>
<select class="input" id="contact-subject">
<option value="">Select a subject</option>
<option value="support">Support</option>
<option value="sales">Sales</option>
<option value="feedback">Feedback</option>
</select>
</div>
<div class="form-group" id="department-group" style="display: none;">
<label>Department</label>
<select class="input" id="department">
<option value="">Select department</option>
</select>
</div>
<div class="form-group">
<label>Message</label>
<textarea class="input" id="contact-message" rows="4"></textarea>
</div>
<button type="submit" class="btn btn-primary">Send Message</button>
<div class="form-message" id="contact-message-display"></div>
</form>
</div>
<!-- Multi-step Form -->
<div class="form-card">
<h2>Multi-step Survey</h2>
<div class="progress-bar">
<div class="progress-fill" id="progress-fill" style="width: 33%"></div>
</div>
<form id="survey-form">
<!-- Step 1 -->
<div class="form-step active" data-step="1">
<h3>Step 1: Basic Information</h3>
<div class="form-group">
<label>Full Name</label>
<input type="text" class="input" id="full-name">
</div>
<div class="form-group">
<label>Email</label>
<input type="email" class="input" id="survey-email">
</div>
<button type="button" class="btn next-step">Next</button>
</div>
<!-- Step 2 -->
<div class="form-step" data-step="2" style="display: none;">
<h3>Step 2: Preferences</h3>
<div class="form-group">
<label>Interests (select multiple)</label>
<select multiple class="input" id="interests">
<option value="tech">Technology</option>
<option value="sports">Sports</option>
<option value="music">Music</option>
<option value="travel">Travel</option>
</select>
</div>
<button type="button" class="btn prev-step">Previous</button>
<button type="button" class="btn next-step">Next</button>
</div>
<!-- Step 3 -->
<div class="form-step" data-step="3" style="display: none;">
<h3>Step 3: Confirmation</h3>
<p>Please review your information and submit.</p>
<button type="button" class="btn prev-step">Previous</button>
<button type="submit" class="btn btn-primary" id="submit-survey">Submit Survey</button>
</div>
</form>
<div class="form-message success-message" id="survey-success" style="display: none;">
✅ Survey submitted successfully!
</div>
</div>
</section>
<!-- Tabs Section -->
<section id="tabs" class="section">
<h1>Tabs Demo</h1>
<div class="tabs-container">
<div class="tabs-header">
<button class="tab-btn active" data-tab="description">Description</button>
<button class="tab-btn" data-tab="reviews">Reviews</button>
<button class="tab-btn" data-tab="specs">Specifications</button>
</div>
<div class="tabs-content">
<div class="tab-pane active" id="description">
<h3>Product Description</h3>
<p>This is a detailed description of the product...</p>
<div class="expandable-text">
<p class="text-preview">Lorem ipsum dolor sit amet, consectetur adipiscing elit...</p>
<button class="btn btn-sm show-more">Show More</button>
<div class="hidden-text" style="display: none;">
<p>This is the hidden text that appears when you click "Show More". It contains additional details about the product that weren't visible initially.</p>
</div>
</div>
</div>
<div class="tab-pane" id="reviews" style="display: none;">
<h3>Customer Reviews</h3>
<button class="btn btn-sm load-comments">Load Comments</button>
<div class="comments-section" style="display: none;">
<!-- Comments will be loaded here -->
</div>
</div>
<div class="tab-pane" id="specs" style="display: none;">
<h3>Technical Specifications</h3>
<table class="specs-table">
<tr><td>Model</td><td>XYZ-2000</td></tr>
<tr><td>Weight</td><td>2.5 kg</td></tr>
<tr><td>Dimensions</td><td>30 x 20 x 10 cm</td></tr>
</table>
</div>
</div>
</div>
</section>
<!-- Data Tables Section -->
<section id="data-tables" class="section">
<h1>Data Tables</h1>
<div class="table-controls">
<input type="text" class="input search-input" placeholder="Search...">
<button class="btn btn-sm" id="export-btn">Export</button>
</div>
<table class="data-table" id="data-table">
<thead>
<tr>
<th class="sortable" data-sort="name">Name ↕</th>
<th class="sortable" data-sort="email">Email ↕</th>
<th class="sortable" data-sort="date">Date ↕</th>
<th>Actions</th>
</tr>
</thead>
<tbody id="table-body">
<!-- Table rows will be loaded here -->
</tbody>
</table>
<button class="btn load-more-rows">Load More Rows</button>
</section>
</main>
<script src="app.js"></script>
</body>
</html>

View File

@@ -0,0 +1,627 @@
/* Playground Styles - Modern Web App Theme */
:root {
--primary-color: #0fbbaa;
--secondary-color: #3f3f44;
--background-color: #ffffff;
--text-color: #333333;
--border-color: #e0e0e0;
--error-color: #ff3c74;
--success-color: #0fbbaa;
--warning-color: #ffa500;
}
* {
box-sizing: border-box;
}
body {
margin: 0;
padding: 0;
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
font-size: 16px;
line-height: 1.6;
color: var(--text-color);
background-color: var(--background-color);
}
/* Cookie Banner */
.cookie-banner {
position: fixed;
bottom: 0;
left: 0;
right: 0;
background-color: #2c3e50;
color: white;
padding: 1rem;
z-index: 1000;
box-shadow: 0 -2px 10px rgba(0,0,0,0.1);
}
.cookie-content {
max-width: 1200px;
margin: 0 auto;
display: flex;
align-items: center;
justify-content: space-between;
flex-wrap: wrap;
gap: 1rem;
}
.cookie-actions {
display: flex;
gap: 0.5rem;
}
/* Header */
.site-header {
background-color: #fff;
border-bottom: 1px solid var(--border-color);
padding: 1rem 2rem;
position: sticky;
top: 0;
z-index: 100;
display: flex;
justify-content: space-between;
align-items: center;
}
.nav-menu {
display: flex;
gap: 2rem;
align-items: center;
}
.nav-link {
text-decoration: none;
color: var(--text-color);
font-weight: 500;
transition: color 0.2s;
}
.nav-link:hover,
.nav-link.active {
color: var(--primary-color);
}
/* Dropdown */
.dropdown {
position: relative;
}
.dropdown-content {
display: none;
position: absolute;
background-color: white;
min-width: 160px;
box-shadow: 0 8px 16px rgba(0,0,0,0.1);
z-index: 1;
border-radius: 4px;
top: 100%;
margin-top: 0.5rem;
}
.dropdown:hover .dropdown-content {
display: block;
}
.dropdown-content a {
color: var(--text-color);
padding: 0.75rem 1rem;
text-decoration: none;
display: block;
}
.dropdown-content a:hover {
background-color: #f5f5f5;
}
/* Auth Section */
.auth-section {
display: flex;
align-items: center;
gap: 1rem;
}
.user-info {
display: flex;
align-items: center;
gap: 0.5rem;
}
.user-avatar {
font-size: 1.5rem;
}
/* Main Content */
.main-content {
padding: 2rem;
max-width: 1200px;
margin: 0 auto;
}
.section {
display: none;
}
.section.active {
display: block;
}
/* Buttons */
.btn {
background-color: var(--primary-color);
color: white;
border: none;
padding: 0.5rem 1rem;
border-radius: 4px;
cursor: pointer;
font-size: 1rem;
font-weight: 500;
transition: all 0.2s;
}
.btn:hover {
background-color: #0aa599;
transform: translateY(-1px);
}
.btn-sm {
padding: 0.25rem 0.75rem;
font-size: 0.875rem;
}
.btn-secondary {
background-color: var(--secondary-color);
}
.btn-secondary:hover {
background-color: #333;
}
.btn-primary {
background-color: var(--primary-color);
}
/* Feature Grid */
.feature-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
gap: 1.5rem;
margin-top: 2rem;
}
.feature-card {
background-color: #f8f9fa;
padding: 1.5rem;
border-radius: 8px;
text-align: center;
transition: transform 0.2s;
}
.feature-card:hover {
transform: translateY(-4px);
box-shadow: 0 4px 12px rgba(0,0,0,0.1);
}
.feature-card h3 {
margin-top: 0;
}
/* Modal */
.modal {
position: fixed;
z-index: 1000;
left: 0;
top: 0;
width: 100%;
height: 100%;
background-color: rgba(0,0,0,0.5);
display: flex;
align-items: center;
justify-content: center;
}
.modal-content {
background-color: white;
padding: 2rem;
border-radius: 8px;
max-width: 500px;
width: 90%;
position: relative;
animation: modalFadeIn 0.3s;
}
@keyframes modalFadeIn {
from { opacity: 0; transform: translateY(-20px); }
to { opacity: 1; transform: translateY(0); }
}
.close {
position: absolute;
right: 1rem;
top: 1rem;
font-size: 1.5rem;
cursor: pointer;
color: #999;
}
.close:hover {
color: #333;
}
/* Forms */
.form-group {
margin-bottom: 1rem;
}
.form-group label {
display: block;
margin-bottom: 0.5rem;
font-weight: 500;
}
.input {
width: 100%;
padding: 0.5rem;
border: 1px solid var(--border-color);
border-radius: 4px;
font-size: 1rem;
}
.input:focus {
outline: none;
border-color: var(--primary-color);
}
.checkbox-label {
display: flex;
align-items: center;
gap: 0.5rem;
}
.form-message {
margin-top: 1rem;
padding: 0.75rem;
border-radius: 4px;
display: none;
}
.form-message.error {
background-color: #ffe6e6;
color: var(--error-color);
display: block;
}
.form-message.success {
background-color: #e6fff6;
color: var(--success-color);
display: block;
}
/* Product Catalog */
.view-toggle {
margin-bottom: 1rem;
}
.catalog-layout {
display: grid;
grid-template-columns: 250px 1fr;
gap: 2rem;
}
.filters-sidebar {
background-color: #f8f9fa;
padding: 1rem;
border-radius: 8px;
}
.filter-group {
margin-bottom: 1.5rem;
}
.collapsible {
cursor: pointer;
display: flex;
justify-content: space-between;
align-items: center;
}
.filter-content {
margin-top: 0.5rem;
}
.filter-content label {
display: block;
margin-bottom: 0.5rem;
}
/* Product Grid */
.product-grid {
display: grid;
grid-template-columns: repeat(auto-fill, minmax(200px, 1fr));
gap: 1.5rem;
}
.product-card {
background-color: white;
border: 1px solid var(--border-color);
border-radius: 8px;
padding: 1rem;
text-align: center;
transition: transform 0.2s;
}
.product-card:hover {
transform: translateY(-4px);
box-shadow: 0 4px 12px rgba(0,0,0,0.1);
}
.product-image {
width: 100%;
height: 150px;
background-color: #f0f0f0;
margin-bottom: 1rem;
display: flex;
align-items: center;
justify-content: center;
font-size: 3rem;
}
.product-name {
font-weight: 600;
margin-bottom: 0.5rem;
}
.product-price {
color: var(--primary-color);
font-size: 1.2rem;
font-weight: 700;
}
/* Loading Indicator */
.loading-indicator {
text-align: center;
padding: 2rem;
}
.spinner {
border: 3px solid #f3f3f3;
border-top: 3px solid var(--primary-color);
border-radius: 50%;
width: 40px;
height: 40px;
animation: spin 1s linear infinite;
margin: 0 auto;
}
@keyframes spin {
0% { transform: rotate(0deg); }
100% { transform: rotate(360deg); }
}
/* Pagination */
.pagination {
display: flex;
gap: 0.5rem;
justify-content: center;
margin-top: 2rem;
}
.page-btn {
padding: 0.5rem 1rem;
border: 1px solid var(--border-color);
background-color: white;
cursor: pointer;
border-radius: 4px;
}
.page-btn:hover,
.page-btn.active {
background-color: var(--primary-color);
color: white;
}
/* Multi-step Form */
.progress-bar {
width: 100%;
height: 8px;
background-color: #e0e0e0;
border-radius: 4px;
margin-bottom: 2rem;
}
.progress-fill {
height: 100%;
background-color: var(--primary-color);
border-radius: 4px;
transition: width 0.3s;
}
.form-step {
display: none;
}
.form-step.active {
display: block;
}
/* Tabs */
.tabs-container {
margin-top: 2rem;
}
.tabs-header {
display: flex;
border-bottom: 2px solid var(--border-color);
}
.tab-btn {
background: none;
border: none;
padding: 1rem 2rem;
cursor: pointer;
font-size: 1rem;
font-weight: 500;
color: var(--text-color);
position: relative;
}
.tab-btn:hover {
color: var(--primary-color);
}
.tab-btn.active {
color: var(--primary-color);
}
.tab-btn.active::after {
content: '';
position: absolute;
bottom: -2px;
left: 0;
right: 0;
height: 2px;
background-color: var(--primary-color);
}
.tabs-content {
padding: 2rem 0;
}
.tab-pane {
display: none;
}
.tab-pane.active {
display: block;
}
/* Expandable Text */
.expandable-text {
margin-top: 1rem;
}
.text-preview {
margin-bottom: 0.5rem;
}
.show-more {
margin-top: 0.5rem;
}
/* Comments Section */
.comments-section {
margin-top: 1rem;
}
.comment {
background-color: #f8f9fa;
padding: 1rem;
border-radius: 4px;
margin-bottom: 1rem;
}
.comment-author {
font-weight: 600;
margin-bottom: 0.5rem;
}
/* Data Table */
.table-controls {
display: flex;
gap: 1rem;
margin-bottom: 1rem;
}
.search-input {
flex: 1;
max-width: 300px;
}
.data-table {
width: 100%;
border-collapse: collapse;
background-color: white;
}
.data-table th,
.data-table td {
padding: 0.75rem;
text-align: left;
border-bottom: 1px solid var(--border-color);
}
.data-table th {
background-color: #f8f9fa;
font-weight: 600;
}
.sortable {
cursor: pointer;
}
.sortable:hover {
color: var(--primary-color);
}
/* Form Cards */
.form-card {
background-color: white;
border: 1px solid var(--border-color);
border-radius: 8px;
padding: 2rem;
margin-bottom: 2rem;
}
.form-card h2 {
margin-top: 0;
}
/* Success Message */
.success-message {
background-color: #e6fff6;
color: var(--success-color);
padding: 1rem;
border-radius: 4px;
text-align: center;
font-weight: 500;
}
/* Load More Button */
.load-more,
.load-more-rows {
display: block;
margin: 2rem auto;
}
/* Responsive */
@media (max-width: 768px) {
.catalog-layout {
grid-template-columns: 1fr;
}
.feature-grid {
grid-template-columns: 1fr;
}
.nav-menu {
flex-wrap: wrap;
gap: 1rem;
}
.cookie-content {
flex-direction: column;
text-align: center;
}
}
/* Inspector Mode */
#inspector-btn.active {
background: var(--primary-color) !important;
color: var(--bg-primary) !important;
}
.inspector-tooltip {
box-shadow: 0 2px 8px rgba(0, 0, 0, 0.3);
border: 1px solid rgba(255, 255, 255, 0.1);
}

View File

@@ -0,0 +1,2 @@
flask>=2.3.0
flask-cors>=4.0.0

View File

@@ -0,0 +1,18 @@
# Basic Page Interaction
# This script demonstrates basic C4A commands
# Navigate to the playground
GO http://127.0.0.1:8080/playground/
# Wait for page to load
WAIT `body` 2
# Handle cookie banner if present
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
# Close newsletter popup if it appears
WAIT 3
IF (EXISTS `.newsletter-popup`) THEN CLICK `.close`
# Click the start tutorial button
CLICK `#start-tutorial`

View File

@@ -0,0 +1,27 @@
# Complete Login Flow
# Demonstrates form interaction and authentication
# Click login button
CLICK `#login-btn`
# Wait for login modal
WAIT `.login-form` 3
# Fill in credentials
CLICK `#email`
TYPE "demo@example.com"
CLICK `#password`
TYPE "demo123"
# Check remember me
IF (EXISTS `#remember-me`) THEN CLICK `#remember-me`
# Submit form
CLICK `button[type="submit"]`
# Wait for success
WAIT `.welcome-message` 5
# Verify login succeeded
IF (EXISTS `.user-info`) THEN EVAL `console.log('✅ Login successful!')`

View File

@@ -0,0 +1,32 @@
# Infinite Scroll Product Loading
# Load all products using scroll automation
# Navigate to catalog
CLICK `#catalog-link`
WAIT `.product-grid` 3
# Switch to infinite scroll mode
CLICK `#infinite-scroll-btn`
# Define scroll procedure
PROC load_more_products
# Get current product count
EVAL `window.initialCount = document.querySelectorAll('.product-card').length`
# Scroll down
SCROLL DOWN 1000
WAIT 2
# Check if more products loaded
EVAL `
const newCount = document.querySelectorAll('.product-card').length;
console.log('Products loaded: ' + newCount);
window.moreLoaded = newCount > window.initialCount;
`
ENDPROC
# Load products until no more
REPEAT (load_more_products, `window.moreLoaded !== false`)
# Final count
EVAL `console.log('✅ Total products: ' + document.querySelectorAll('.product-card').length)`

View File

@@ -0,0 +1,41 @@
# Multi-step Form Wizard
# Complete a complex form with multiple steps
# Navigate to forms section
CLICK `a[href="#forms"]`
WAIT `#survey-form` 2
# Step 1: Basic Information
CLICK `#full-name`
TYPE "John Doe"
CLICK `#survey-email`
TYPE "john.doe@example.com"
# Go to next step
CLICK `.next-step`
WAIT 1
# Step 2: Select Interests
# Select multiple options
CLICK `#interests`
CLICK `option[value="tech"]`
CLICK `option[value="music"]`
CLICK `option[value="travel"]`
# Continue to final step
CLICK `.next-step`
WAIT 1
# Step 3: Review and Submit
# Verify we're on the last step
IF (EXISTS `#submit-survey`) THEN EVAL `console.log('📋 On final step')`
# Submit the form
CLICK `#submit-survey`
# Wait for success message
WAIT `.success-message` 5
# Verify submission
IF (EXISTS `.success-message`) THEN EVAL `console.log('✅ Survey submitted successfully!')`

View File

@@ -0,0 +1,82 @@
# Complete E-commerce Workflow
# Login, browse products, and interact with various elements
# Define reusable procedures
PROC handle_popups
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
IF (EXISTS `.newsletter-popup`) THEN CLICK `.close`
ENDPROC
PROC login_user
CLICK `#login-btn`
WAIT `.login-form` 2
CLICK `#email`
TYPE "demo@example.com"
CLICK `#password`
TYPE "demo123"
CLICK `button[type="submit"]`
WAIT `.welcome-message` 5
ENDPROC
PROC browse_products
# Go to catalog
CLICK `#catalog-link`
WAIT `.product-grid` 3
# Apply filters
CLICK `.collapsible`
WAIT 0.5
CLICK `input[type="checkbox"]`
# Load some products
SCROLL DOWN 500
WAIT 1
SCROLL DOWN 500
WAIT 1
ENDPROC
# Main workflow
GO http://127.0.0.1:8080/playground/
WAIT `body` 2
# Handle initial popups
handle_popups
# Login if not already
IF (NOT EXISTS `.user-info`) THEN login_user
# Browse products
browse_products
# Navigate to tabs demo
CLICK `a[href="#tabs"]`
WAIT `.tabs-container` 2
# Interact with tabs
CLICK `button[data-tab="reviews"]`
WAIT 1
# Load comments
IF (EXISTS `.load-comments`) THEN CLICK `.load-comments`
WAIT `.comments-section` 2
# Check specifications
CLICK `button[data-tab="specs"]`
WAIT 1
# Final navigation to data tables
CLICK `a[href="#data"]`
WAIT `.data-table` 2
# Search in table
CLICK `.search-input`
TYPE "User"
# Load more rows
CLICK `.load-more-rows`
WAIT 1
# Export data
CLICK `#export-btn`
EVAL `console.log('✅ Workflow completed successfully!')`

View File

@@ -0,0 +1,304 @@
#!/usr/bin/env python3
"""
C4A-Script Tutorial Server
Serves the tutorial app and provides C4A compilation API
"""
import sys
import os
from pathlib import Path
from flask import Flask, render_template_string, request, jsonify, send_from_directory
from flask_cors import CORS
# Add parent directories to path to import crawl4ai
sys.path.insert(0, str(Path(__file__).parent.parent.parent.parent.parent))
try:
from crawl4ai.script import compile as c4a_compile
C4A_AVAILABLE = True
except ImportError:
print("⚠️ C4A compiler not available. Using mock compiler.")
C4A_AVAILABLE = False
app = Flask(__name__)
CORS(app)
# Serve static files
@app.route('/')
def index():
return send_from_directory('.', 'index.html')
@app.route('/assets/<path:path>')
def serve_assets(path):
return send_from_directory('assets', path)
@app.route('/playground/')
def playground():
return send_from_directory('playground', 'index.html')
@app.route('/playground/<path:path>')
def serve_playground(path):
return send_from_directory('playground', path)
# API endpoint for C4A compilation
@app.route('/api/compile', methods=['POST'])
def compile_endpoint():
try:
data = request.get_json()
script = data.get('script', '')
if not script:
return jsonify({
'success': False,
'error': {
'line': 1,
'column': 1,
'message': 'No script provided',
'suggestion': 'Write some C4A commands'
}
})
if C4A_AVAILABLE:
# Use real C4A compiler
result = c4a_compile(script)
if result.success:
return jsonify({
'success': True,
'jsCode': result.js_code,
'metadata': {
'lineCount': len(result.js_code),
'sourceLines': len(script.split('\n'))
}
})
else:
error = result.first_error
return jsonify({
'success': False,
'error': {
'line': error.line,
'column': error.column,
'message': error.message,
'suggestion': error.suggestions[0].message if error.suggestions else None,
'code': error.code,
'sourceLine': error.source_line
}
})
else:
# Use mock compiler for demo
result = mock_compile(script)
return jsonify(result)
except Exception as e:
return jsonify({
'success': False,
'error': {
'line': 1,
'column': 1,
'message': f'Server error: {str(e)}',
'suggestion': 'Check server logs'
}
}), 500
def mock_compile(script):
"""Simple mock compiler for demo when C4A is not available"""
lines = [line for line in script.split('\n') if line.strip() and not line.strip().startswith('#')]
js_code = []
for i, line in enumerate(lines):
line = line.strip()
try:
if line.startswith('GO '):
url = line[3:].strip()
# Handle relative URLs
if not url.startswith(('http://', 'https://')):
url = '/' + url.lstrip('/')
js_code.append(f"await page.goto('{url}');")
elif line.startswith('WAIT '):
parts = line[5:].strip().split(' ')
if parts[0].startswith('`'):
selector = parts[0].strip('`')
timeout = parts[1] if len(parts) > 1 else '5'
js_code.append(f"await page.waitForSelector('{selector}', {{ timeout: {timeout}000 }});")
else:
seconds = parts[0]
js_code.append(f"await page.waitForTimeout({seconds}000);")
elif line.startswith('CLICK '):
selector = line[6:].strip().strip('`')
js_code.append(f"await page.click('{selector}');")
elif line.startswith('TYPE '):
text = line[5:].strip().strip('"')
js_code.append(f"await page.keyboard.type('{text}');")
elif line.startswith('SCROLL '):
parts = line[7:].strip().split(' ')
direction = parts[0]
amount = parts[1] if len(parts) > 1 else '500'
if direction == 'DOWN':
js_code.append(f"await page.evaluate(() => window.scrollBy(0, {amount}));")
elif direction == 'UP':
js_code.append(f"await page.evaluate(() => window.scrollBy(0, -{amount}));")
elif line.startswith('IF '):
if 'THEN' not in line:
return {
'success': False,
'error': {
'line': i + 1,
'column': len(line),
'message': "Missing 'THEN' keyword after IF condition",
'suggestion': "Add 'THEN' after the condition",
'sourceLine': line
}
}
condition = line[3:line.index('THEN')].strip()
action = line[line.index('THEN') + 4:].strip()
if 'EXISTS' in condition:
selector_match = condition.split('`')
if len(selector_match) >= 2:
selector = selector_match[1]
action_selector = action.split('`')[1] if '`' in action else ''
js_code.append(
f"if (await page.$$('{selector}').length > 0) {{ "
f"await page.click('{action_selector}'); }}"
)
elif line.startswith('PRESS '):
key = line[6:].strip()
js_code.append(f"await page.keyboard.press('{key}');")
else:
# Unknown command
return {
'success': False,
'error': {
'line': i + 1,
'column': 1,
'message': f"Unknown command: {line.split()[0]}",
'suggestion': "Check command syntax",
'sourceLine': line
}
}
except Exception as e:
return {
'success': False,
'error': {
'line': i + 1,
'column': 1,
'message': f"Failed to parse: {str(e)}",
'suggestion': "Check syntax",
'sourceLine': line
}
}
return {
'success': True,
'jsCode': js_code,
'metadata': {
'lineCount': len(js_code),
'sourceLines': len(lines)
}
}
# Example scripts endpoint
@app.route('/api/examples')
def get_examples():
examples = [
{
'id': 'cookie-banner',
'name': 'Handle Cookie Banner',
'description': 'Accept cookies and close newsletter popup',
'script': '''# Handle cookie banner and newsletter
GO http://127.0.0.1:8080/playground/
WAIT `body` 2
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
IF (EXISTS `.newsletter-popup`) THEN CLICK `.close`'''
},
{
'id': 'login',
'name': 'Login Flow',
'description': 'Complete login with credentials',
'script': '''# Login to the site
CLICK `#login-btn`
WAIT `.login-form` 2
CLICK `#email`
TYPE "demo@example.com"
CLICK `#password`
TYPE "demo123"
IF (EXISTS `#remember-me`) THEN CLICK `#remember-me`
CLICK `button[type="submit"]`
WAIT `.welcome-message` 5'''
},
{
'id': 'infinite-scroll',
'name': 'Infinite Scroll',
'description': 'Load products with scrolling',
'script': '''# Navigate to catalog and scroll
CLICK `#catalog-link`
WAIT `.product-grid` 3
# Scroll multiple times to load products
SCROLL DOWN 1000
WAIT 1
SCROLL DOWN 1000
WAIT 1
SCROLL DOWN 1000'''
},
{
'id': 'form-wizard',
'name': 'Multi-step Form',
'description': 'Complete a multi-step survey',
'script': '''# Navigate to forms
CLICK `a[href="#forms"]`
WAIT `#survey-form` 2
# Step 1: Basic info
CLICK `#full-name`
TYPE "John Doe"
CLICK `#survey-email`
TYPE "john@example.com"
CLICK `.next-step`
WAIT 1
# Step 2: Preferences
CLICK `#interests`
CLICK `option[value="tech"]`
CLICK `option[value="music"]`
CLICK `.next-step`
WAIT 1
# Step 3: Submit
CLICK `#submit-survey`
WAIT `.success-message` 5'''
}
]
return jsonify(examples)
if __name__ == '__main__':
port = int(os.environ.get('PORT', 8080))
print(f"""
╔══════════════════════════════════════════════════════════╗
║ C4A-Script Interactive Tutorial Server ║
╠══════════════════════════════════════════════════════════╣
║ ║
║ Server running at: http://localhost:{port:<6}
║ ║
║ Features: ║
║ • C4A-Script compilation API ║
║ • Interactive playground ║
║ • Real-time execution visualization ║
║ ║
║ C4A Compiler: {'✓ Available' if C4A_AVAILABLE else '✗ Using mock compiler':<30}
║ ║
╚══════════════════════════════════════════════════════════╝
""")
app.run(host='0.0.0.0', port=port, debug=True)

View File

@@ -0,0 +1,69 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Blockly Test</title>
<style>
body {
margin: 0;
padding: 20px;
background: #0e0e10;
color: #e0e0e0;
font-family: monospace;
}
#blocklyDiv {
height: 600px;
width: 100%;
border: 1px solid #2a2a2c;
}
#output {
margin-top: 20px;
padding: 15px;
background: #1a1a1b;
border: 1px solid #2a2a2c;
white-space: pre-wrap;
}
</style>
</head>
<body>
<h1>C4A-Script Blockly Test</h1>
<div id="blocklyDiv"></div>
<div id="output">
<h3>Generated C4A-Script:</h3>
<pre id="code-output"></pre>
</div>
<script src="https://unpkg.com/blockly/blockly.min.js"></script>
<script src="assets/c4a-blocks.js"></script>
<script>
// Simple test
const workspace = Blockly.inject('blocklyDiv', {
toolbox: `
<xml>
<category name="Test" colour="#1E88E5">
<block type="c4a_go"></block>
<block type="c4a_wait_time"></block>
<block type="c4a_click"></block>
</category>
</xml>
`,
theme: Blockly.Theme.defineTheme('dark', {
'base': Blockly.Themes.Classic,
'componentStyles': {
'workspaceBackgroundColour': '#0e0e10',
'toolboxBackgroundColour': '#1a1a1b',
'toolboxForegroundColour': '#e0e0e0',
'flyoutBackgroundColour': '#1a1a1b',
'flyoutForegroundColour': '#e0e0e0',
}
})
});
workspace.addChangeListener((event) => {
const code = Blockly.JavaScript.workspaceToCode(workspace);
document.getElementById('code-output').textContent = code;
});
</script>
</body>
</html>

View File

@@ -0,0 +1,124 @@
# Crawl4AI Chrome Extension
Visual extraction tools for Crawl4AI - Click to extract data and content from any webpage!
## 🚀 Features
- **Click2Crawl**: Click on elements to build data extraction schemas instantly
- **Markdown Extraction**: Select elements and export as clean markdown
- **Script Builder (Alpha)**: Record browser actions to create automation scripts
- **Smart Element Selection**: Container and field selection with visual feedback
- **Code Generation**: Generates complete Python code for Crawl4AI
- **Beautiful Dark UI**: Consistent with Crawl4AI's design language
## 📦 Installation
### Method 1: Load Unpacked Extension (Recommended for Development)
1. Open Chrome and navigate to `chrome://extensions/`
2. Enable "Developer mode" in the top right corner
3. Click "Load unpacked"
4. Select the `crawl4ai-assistant` folder
5. The extension icon (🚀🤖) will appear in your toolbar
### Method 2: Generate Icons First
If you want proper icons:
1. Open `icons/generate_icons.html` in your browser
2. Right-click each canvas and save as:
- `icon-16.png`
- `icon-48.png`
- `icon-128.png`
3. Then follow Method 1 above
## 🎯 How to Use
### Using Click2Crawl
1. **Navigate to any website** you want to extract data from
2. **Click the Crawl4AI extension icon** in your toolbar
3. **Click "Click2Crawl"** to start the capture mode
4. **Select a container element**:
- Hover over elements (they'll highlight in blue)
- Click on a repeating container (e.g., product card, article block)
5. **Select fields within the container**:
- Elements will now highlight in green
- Click on each piece of data you want to extract
- Name each field (e.g., "title", "price", "description")
6. **Test and Export**:
- Click "Test Schema" to see extracted data instantly
- Export as Python code, JSON schema, or markdown
### Running the Generated Code
The downloaded Python file contains:
```python
# 1. The HTML snippet of your selected container
HTML_SNIPPET = """..."""
# 2. The extraction query based on your selections
EXTRACTION_QUERY = """..."""
# 3. Functions to generate and test the schema
async def generate_schema():
# Generates the extraction schema using LLM
async def test_extraction():
# Tests the schema on the actual website
```
To use it:
1. Install Crawl4AI: `pip install crawl4ai`
2. Run the script: `python crawl4ai_schema_*.py`
3. The script will generate a `generated_schema.json` file
4. Use this schema in your Crawl4AI projects!
## 🎨 Visual Feedback
- **Blue dashed outline**: Container selection mode
- **Green dashed outline**: Field selection mode
- **Solid blue outline**: Selected container
- **Solid green outline**: Selected fields
- **Floating toolbar**: Shows current mode and selection status
## ⌨️ Keyboard Shortcuts
- **ESC**: Cancel current capture session
## 🔧 Technical Details
- Built with Manifest V3 for security and performance
- Pure client-side - no data sent to external servers
- Generates code that uses Crawl4AI's LLM integration
- Smart selector generation prioritizes stable attributes
## 🐛 Troubleshooting
### Extension doesn't load
- Make sure you're in Developer Mode
- Check the console for any errors
- Ensure all files are in the correct directories
### Can't select elements
- Some websites may block extensions
- Try refreshing the page
- Make sure you clicked "Schema Builder" first
### Generated code doesn't work
- Ensure you have Crawl4AI installed
- Check that you have an LLM API key configured
- Make sure the website structure hasn't changed
## 🤝 Contributing
This extension is part of the Crawl4AI project. Contributions are welcome!
- Report issues: [GitHub Issues](https://github.com/unclecode/crawl4ai/issues)
- Join discussion: [Discord](https://discord.gg/crawl4ai)
## 📄 License
Same as Crawl4AI - see main project for details.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,39 @@
// Service worker for Crawl4AI Assistant
// Handle messages from content script
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
if (message.action === 'downloadCode' || message.action === 'downloadScript') {
try {
// Create a data URL for the Python code
const dataUrl = 'data:text/plain;charset=utf-8,' + encodeURIComponent(message.code);
// Download the file
chrome.downloads.download({
url: dataUrl,
filename: message.filename || 'crawl4ai_schema.py',
saveAs: true
}, (downloadId) => {
if (chrome.runtime.lastError) {
console.error('Download failed:', chrome.runtime.lastError);
sendResponse({ success: false, error: chrome.runtime.lastError.message });
} else {
console.log('Download started with ID:', downloadId);
sendResponse({ success: true, downloadId: downloadId });
}
});
} catch (error) {
console.error('Error creating download:', error);
sendResponse({ success: false, error: error.message });
}
return true; // Keep the message channel open for async response
}
return false;
});
// Clean up on extension install/update
chrome.runtime.onInstalled.addListener(() => {
// Clear any stored state
chrome.storage.local.clear();
});

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,78 @@
// Main content script for Crawl4AI Assistant
// Coordinates between Click2Crawl, ScriptBuilder, and MarkdownExtraction
let activeBuilder = null;
// Listen for messages from popup
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
if (request.action === 'startCapture') {
if (activeBuilder) {
console.log('Stopping existing capture session');
activeBuilder.stop();
activeBuilder = null;
}
if (request.mode === 'schema') {
console.log('Starting Click2Crawl');
activeBuilder = new Click2Crawl();
activeBuilder.start();
} else if (request.mode === 'script') {
console.log('Starting Script Builder');
activeBuilder = new ScriptBuilder();
activeBuilder.start();
}
sendResponse({ success: true });
} else if (request.action === 'stopCapture') {
if (activeBuilder) {
activeBuilder.stop();
activeBuilder = null;
}
sendResponse({ success: true });
} else if (request.action === 'startSchemaCapture') {
if (activeBuilder) {
activeBuilder.deactivate?.();
activeBuilder = null;
}
console.log('Starting Click2Crawl');
activeBuilder = new Click2Crawl();
activeBuilder.start();
sendResponse({ success: true });
} else if (request.action === 'startScriptCapture') {
if (activeBuilder) {
activeBuilder.deactivate?.();
activeBuilder = null;
}
console.log('Starting Script Builder');
activeBuilder = new ScriptBuilder();
activeBuilder.start();
sendResponse({ success: true });
} else if (request.action === 'startClick2Crawl') {
if (activeBuilder) {
activeBuilder.deactivate?.();
activeBuilder = null;
}
console.log('Starting Markdown Extraction');
activeBuilder = new MarkdownExtraction();
sendResponse({ success: true });
} else if (request.action === 'generateCode') {
if (activeBuilder && activeBuilder.generateCode) {
activeBuilder.generateCode();
}
sendResponse({ success: true });
}
});
// Cleanup on page unload
window.addEventListener('beforeunload', () => {
if (activeBuilder) {
if (activeBuilder.deactivate) {
activeBuilder.deactivate();
} else if (activeBuilder.stop) {
activeBuilder.stop();
}
activeBuilder = null;
}
});
console.log('Crawl4AI Assistant content script loaded');

View File

@@ -0,0 +1,623 @@
class ContentAnalyzer {
constructor() {
this.patterns = {
article: ['article', 'main', 'content', 'post', 'entry'],
navigation: ['nav', 'menu', 'navigation', 'breadcrumb'],
sidebar: ['sidebar', 'aside', 'widget'],
header: ['header', 'masthead', 'banner'],
footer: ['footer', 'copyright', 'contact'],
list: ['list', 'items', 'results', 'products', 'cards'],
table: ['table', 'grid', 'data'],
media: ['gallery', 'carousel', 'slideshow', 'video', 'media']
};
}
async analyze(elements) {
const analysis = {
structure: await this.analyzeStructure(elements),
contentType: this.identifyContentType(elements),
hierarchy: this.buildHierarchy(elements),
mediaAssets: this.collectMediaAssets(elements),
textDensity: this.calculateTextDensity(elements),
semanticRegions: this.identifySemanticRegions(elements),
relationships: this.analyzeRelationships(elements),
metadata: this.extractMetadata(elements)
};
return analysis;
}
analyzeStructure(elements) {
const structure = {
hasHeadings: false,
hasLists: false,
hasTables: false,
hasMedia: false,
hasCode: false,
hasLinks: false,
layout: 'linear', // linear, grid, mixed
depth: 0,
elementTypes: new Map()
};
// Analyze each element
for (const element of elements) {
this.analyzeElementStructure(element, structure);
}
// Determine layout type
structure.layout = this.determineLayout(elements);
// Calculate max depth
structure.depth = this.calculateMaxDepth(elements);
return structure;
}
analyzeElementStructure(element, structure, visited = new Set()) {
if (visited.has(element)) return;
visited.add(element);
const tagName = element.tagName;
// Update element type count
structure.elementTypes.set(
tagName,
(structure.elementTypes.get(tagName) || 0) + 1
);
// Check for specific structures
if (/^H[1-6]$/.test(tagName)) {
structure.hasHeadings = true;
} else if (['UL', 'OL', 'DL'].includes(tagName)) {
structure.hasLists = true;
} else if (tagName === 'TABLE') {
structure.hasTables = true;
} else if (['IMG', 'VIDEO', 'IFRAME', 'PICTURE'].includes(tagName)) {
structure.hasMedia = true;
} else if (['CODE', 'PRE'].includes(tagName)) {
structure.hasCode = true;
} else if (tagName === 'A') {
structure.hasLinks = true;
}
// Analyze children
for (const child of element.children) {
this.analyzeElementStructure(child, structure, visited);
}
}
identifyContentType(elements) {
const scores = {
article: 0,
list: 0,
table: 0,
form: 0,
media: 0,
mixed: 0
};
for (const element of elements) {
// Score based on element types and classes
const tagName = element.tagName;
const className = element.className.toLowerCase();
const id = element.id.toLowerCase();
// Check for article patterns
if (tagName === 'ARTICLE' ||
this.matchesPattern(className + ' ' + id, this.patterns.article)) {
scores.article += 10;
}
// Check for list patterns
if (['UL', 'OL'].includes(tagName) ||
this.matchesPattern(className, this.patterns.list)) {
scores.list += 5;
}
// Check for table
if (tagName === 'TABLE') {
scores.table += 10;
}
// Check for form
if (tagName === 'FORM' || element.querySelector('input, select, textarea')) {
scores.form += 5;
}
// Check for media gallery
if (this.matchesPattern(className, this.patterns.media) ||
element.querySelectorAll('img, video').length > 3) {
scores.media += 5;
}
}
// Determine primary content type
const maxScore = Math.max(...Object.values(scores));
if (maxScore === 0) return 'unknown';
for (const [type, score] of Object.entries(scores)) {
if (score === maxScore) {
return type;
}
}
return 'mixed';
}
buildHierarchy(elements) {
const hierarchy = {
root: null,
levels: [],
headingStructure: []
};
// Find common ancestor
if (elements.length > 0) {
hierarchy.root = this.findCommonAncestor(elements);
}
// Build heading hierarchy
const headings = [];
for (const element of elements) {
const foundHeadings = element.querySelectorAll('h1, h2, h3, h4, h5, h6');
headings.push(...Array.from(foundHeadings));
}
// Sort headings by document position
headings.sort((a, b) => {
const position = a.compareDocumentPosition(b);
if (position & Node.DOCUMENT_POSITION_FOLLOWING) {
return -1;
} else if (position & Node.DOCUMENT_POSITION_PRECEDING) {
return 1;
}
return 0;
});
// Build heading structure
let currentLevel = 0;
const stack = [];
for (const heading of headings) {
const level = parseInt(heading.tagName.substring(1));
const item = {
level,
text: heading.textContent.trim(),
element: heading,
children: []
};
// Find parent in stack
while (stack.length > 0 && stack[stack.length - 1].level >= level) {
stack.pop();
}
if (stack.length > 0) {
stack[stack.length - 1].children.push(item);
} else {
hierarchy.headingStructure.push(item);
}
stack.push(item);
}
return hierarchy;
}
collectMediaAssets(elements) {
const media = {
images: [],
videos: [],
iframes: [],
audio: []
};
for (const element of elements) {
// Collect images
const images = element.querySelectorAll('img');
for (const img of images) {
media.images.push({
src: img.src,
alt: img.alt,
title: img.title,
width: img.width,
height: img.height,
element: img
});
}
// Collect videos
const videos = element.querySelectorAll('video');
for (const video of videos) {
media.videos.push({
src: video.src,
poster: video.poster,
width: video.width,
height: video.height,
element: video
});
}
// Collect iframes
const iframes = element.querySelectorAll('iframe');
for (const iframe of iframes) {
media.iframes.push({
src: iframe.src,
width: iframe.width,
height: iframe.height,
title: iframe.title,
element: iframe
});
}
// Collect audio
const audios = element.querySelectorAll('audio');
for (const audio of audios) {
media.audio.push({
src: audio.src,
element: audio
});
}
}
return media;
}
calculateTextDensity(elements) {
let totalText = 0;
let totalElements = 0;
let linkText = 0;
let codeText = 0;
for (const element of elements) {
const stats = this.getTextStats(element);
totalText += stats.textLength;
totalElements += stats.elementCount;
linkText += stats.linkTextLength;
codeText += stats.codeTextLength;
}
return {
textLength: totalText,
elementCount: totalElements,
averageTextPerElement: totalElements > 0 ? totalText / totalElements : 0,
linkDensity: totalText > 0 ? linkText / totalText : 0,
codeDensity: totalText > 0 ? codeText / totalText : 0
};
}
getTextStats(element, visited = new Set()) {
if (visited.has(element)) {
return { textLength: 0, elementCount: 0, linkTextLength: 0, codeTextLength: 0 };
}
visited.add(element);
let stats = {
textLength: 0,
elementCount: 1,
linkTextLength: 0,
codeTextLength: 0
};
// Get direct text content
for (const node of element.childNodes) {
if (node.nodeType === Node.TEXT_NODE) {
const text = node.textContent.trim();
stats.textLength += text.length;
// Check if this text is within a link
if (element.tagName === 'A') {
stats.linkTextLength += text.length;
}
// Check if this text is within code
if (['CODE', 'PRE'].includes(element.tagName)) {
stats.codeTextLength += text.length;
}
}
}
// Process children
for (const child of element.children) {
const childStats = this.getTextStats(child, visited);
stats.textLength += childStats.textLength;
stats.elementCount += childStats.elementCount;
stats.linkTextLength += childStats.linkTextLength;
stats.codeTextLength += childStats.codeTextLength;
}
return stats;
}
identifySemanticRegions(elements) {
const regions = {
headers: [],
navigation: [],
main: [],
sidebars: [],
footers: [],
articles: []
};
for (const element of elements) {
// Check element and its ancestors for semantic regions
let current = element;
while (current) {
const tagName = current.tagName;
const className = current.className.toLowerCase();
const role = current.getAttribute('role');
// Check semantic HTML5 elements
if (tagName === 'HEADER' || role === 'banner') {
regions.headers.push(current);
} else if (tagName === 'NAV' || role === 'navigation') {
regions.navigation.push(current);
} else if (tagName === 'MAIN' || role === 'main') {
regions.main.push(current);
} else if (tagName === 'ASIDE' || role === 'complementary') {
regions.sidebars.push(current);
} else if (tagName === 'FOOTER' || role === 'contentinfo') {
regions.footers.push(current);
} else if (tagName === 'ARTICLE' || role === 'article') {
regions.articles.push(current);
}
// Check class patterns
if (this.matchesPattern(className, this.patterns.header)) {
regions.headers.push(current);
} else if (this.matchesPattern(className, this.patterns.navigation)) {
regions.navigation.push(current);
} else if (this.matchesPattern(className, this.patterns.sidebar)) {
regions.sidebars.push(current);
} else if (this.matchesPattern(className, this.patterns.footer)) {
regions.footers.push(current);
}
current = current.parentElement;
}
}
// Deduplicate
for (const key of Object.keys(regions)) {
regions[key] = Array.from(new Set(regions[key]));
}
return regions;
}
analyzeRelationships(elements) {
const relationships = {
siblings: [],
parents: [],
children: [],
relatedByClass: new Map(),
relatedByStructure: []
};
// Find sibling relationships
for (let i = 0; i < elements.length; i++) {
for (let j = i + 1; j < elements.length; j++) {
if (elements[i].parentElement === elements[j].parentElement) {
relationships.siblings.push([elements[i], elements[j]]);
}
}
}
// Find parent-child relationships
for (const element of elements) {
for (const other of elements) {
if (element !== other) {
if (element.contains(other)) {
relationships.parents.push({ parent: element, child: other });
} else if (other.contains(element)) {
relationships.children.push({ parent: other, child: element });
}
}
}
}
// Group by similar classes
for (const element of elements) {
const classes = Array.from(element.classList);
for (const className of classes) {
if (!relationships.relatedByClass.has(className)) {
relationships.relatedByClass.set(className, []);
}
relationships.relatedByClass.get(className).push(element);
}
}
// Find structurally similar elements
for (let i = 0; i < elements.length; i++) {
for (let j = i + 1; j < elements.length; j++) {
if (this.areStructurallySimilar(elements[i], elements[j])) {
relationships.relatedByStructure.push([elements[i], elements[j]]);
}
}
}
return relationships;
}
areStructurallySimilar(element1, element2) {
// Same tag name
if (element1.tagName !== element2.tagName) {
return false;
}
// Similar class structure
const classes1 = Array.from(element1.classList).sort();
const classes2 = Array.from(element2.classList).sort();
// At least 50% overlap in classes
const intersection = classes1.filter(c => classes2.includes(c));
const union = Array.from(new Set([...classes1, ...classes2]));
if (union.length > 0 && intersection.length / union.length >= 0.5) {
return true;
}
// Similar child structure
if (element1.children.length === element2.children.length) {
const childTags1 = Array.from(element1.children).map(c => c.tagName).sort();
const childTags2 = Array.from(element2.children).map(c => c.tagName).sort();
if (JSON.stringify(childTags1) === JSON.stringify(childTags2)) {
return true;
}
}
return false;
}
extractMetadata(elements) {
const metadata = {
title: null,
description: null,
author: null,
date: null,
tags: [],
microdata: []
};
for (const element of elements) {
// Look for title
const h1 = element.querySelector('h1');
if (h1 && !metadata.title) {
metadata.title = h1.textContent.trim();
}
// Look for meta information
const metaElements = element.querySelectorAll('[itemprop], [property], [name]');
for (const meta of metaElements) {
const prop = meta.getAttribute('itemprop') ||
meta.getAttribute('property') ||
meta.getAttribute('name');
const content = meta.getAttribute('content') || meta.textContent.trim();
if (prop && content) {
if (prop.includes('author')) {
metadata.author = content;
} else if (prop.includes('date') || prop.includes('time')) {
metadata.date = content;
} else if (prop.includes('description')) {
metadata.description = content;
} else if (prop.includes('tag') || prop.includes('keyword')) {
metadata.tags.push(content);
}
metadata.microdata.push({ property: prop, value: content });
}
}
// Look for time elements
const timeElements = element.querySelectorAll('time');
for (const time of timeElements) {
if (!metadata.date && time.dateTime) {
metadata.date = time.dateTime;
}
}
}
return metadata;
}
determineLayout(elements) {
// Check if elements form a grid
const positions = elements.map(el => {
const rect = el.getBoundingClientRect();
return { x: rect.left, y: rect.top, width: rect.width, height: rect.height };
});
// Check for grid layout (multiple elements on same row)
const rows = new Map();
for (const pos of positions) {
const row = Math.round(pos.y / 10) * 10; // Round to nearest 10px
if (!rows.has(row)) {
rows.set(row, []);
}
rows.get(row).push(pos);
}
// If multiple elements share rows, it's likely a grid
const hasGrid = Array.from(rows.values()).some(row => row.length > 1);
if (hasGrid) {
return 'grid';
}
// Check for mixed layout (significant variation in widths)
const widths = positions.map(p => p.width);
const avgWidth = widths.reduce((a, b) => a + b, 0) / widths.length;
const variance = widths.reduce((sum, w) => sum + Math.pow(w - avgWidth, 2), 0) / widths.length;
const stdDev = Math.sqrt(variance);
if (stdDev / avgWidth > 0.3) {
return 'mixed';
}
return 'linear';
}
calculateMaxDepth(elements) {
let maxDepth = 0;
for (const element of elements) {
const depth = this.getElementDepth(element);
maxDepth = Math.max(maxDepth, depth);
}
return maxDepth;
}
getElementDepth(element, depth = 0) {
if (element.children.length === 0) {
return depth;
}
let maxChildDepth = depth;
for (const child of element.children) {
const childDepth = this.getElementDepth(child, depth + 1);
maxChildDepth = Math.max(maxChildDepth, childDepth);
}
return maxChildDepth;
}
findCommonAncestor(elements) {
if (elements.length === 0) return null;
if (elements.length === 1) return elements[0].parentElement;
// Start with the first element's ancestors
let ancestor = elements[0];
const ancestors = [];
while (ancestor) {
ancestors.push(ancestor);
ancestor = ancestor.parentElement;
}
// Find the deepest common ancestor
for (const ancestorCandidate of ancestors) {
let isCommon = true;
for (const element of elements) {
if (!ancestorCandidate.contains(element)) {
isCommon = false;
break;
}
}
if (isCommon) {
return ancestorCandidate;
}
}
return document.body;
}
matchesPattern(text, patterns) {
return patterns.some(pattern => text.includes(pattern));
}
}

View File

@@ -0,0 +1,718 @@
class MarkdownConverter {
constructor() {
// Conversion handlers for different element types
this.converters = {
'H1': async (el, ctx) => await this.convertHeading(el, 1, ctx),
'H2': async (el, ctx) => await this.convertHeading(el, 2, ctx),
'H3': async (el, ctx) => await this.convertHeading(el, 3, ctx),
'H4': async (el, ctx) => await this.convertHeading(el, 4, ctx),
'H5': async (el, ctx) => await this.convertHeading(el, 5, ctx),
'H6': async (el, ctx) => await this.convertHeading(el, 6, ctx),
'P': async (el, ctx) => await this.convertParagraph(el, ctx),
'A': async (el, ctx) => await this.convertLink(el, ctx),
'IMG': async (el, ctx) => await this.convertImage(el, ctx),
'UL': async (el, ctx) => await this.convertList(el, 'ul', ctx),
'OL': async (el, ctx) => await this.convertList(el, 'ol', ctx),
'LI': async (el, ctx) => await this.convertListItem(el, ctx),
'TABLE': async (el, ctx) => await this.convertTable(el, ctx),
'BLOCKQUOTE': async (el, ctx) => await this.convertBlockquote(el, ctx),
'PRE': async (el, ctx) => await this.convertPreformatted(el, ctx),
'CODE': async (el, ctx) => await this.convertCode(el, ctx),
'HR': async (el, ctx) => '\n---\n',
'BR': async (el, ctx) => ' \n',
'STRONG': async (el, ctx) => `**${await this.getTextContent(el, ctx)}**`,
'B': async (el, ctx) => `**${await this.getTextContent(el, ctx)}**`,
'EM': async (el, ctx) => `*${await this.getTextContent(el, ctx)}*`,
'I': async (el, ctx) => `*${await this.getTextContent(el, ctx)}*`,
'DEL': async (el, ctx) => `~~${await this.getTextContent(el, ctx)}~~`,
'S': async (el, ctx) => `~~${await this.getTextContent(el, ctx)}~~`,
'DIV': async (el, ctx) => await this.convertDiv(el, ctx),
'SPAN': async (el, ctx) => await this.convertSpan(el, ctx),
'ARTICLE': async (el, ctx) => await this.convertArticle(el, ctx),
'SECTION': async (el, ctx) => await this.convertSection(el, ctx),
'FIGURE': async (el, ctx) => await this.convertFigure(el, ctx),
'FIGCAPTION': async (el, ctx) => await this.convertFigCaption(el, ctx),
'VIDEO': async (el, ctx) => await this.convertVideo(el, ctx),
'IFRAME': async (el, ctx) => await this.convertIframe(el, ctx),
'DL': async (el, ctx) => await this.convertDefinitionList(el, ctx),
'DT': async (el, ctx) => await this.convertDefinitionTerm(el, ctx),
'DD': async (el, ctx) => await this.convertDefinitionDescription(el, ctx),
'TR': async (el, ctx) => await this.convertTableRow(el, ctx)
};
// Maintain context during conversion
this.conversionContext = {
listDepth: 0,
inTable: false,
inCode: false,
preserveWhitespace: false,
references: [],
imageCount: 0,
linkCount: 0
};
}
async convert(elements, options = {}) {
// Reset context
this.resetContext();
// Apply options
this.options = {
includeImages: true,
preserveTables: true,
keepCodeFormatting: true,
simplifyLayout: false,
preserveLinks: true,
...options
};
// Convert elements
const markdownParts = [];
for (const element of elements) {
const markdown = await this.convertElement(element, this.conversionContext);
if (markdown.trim()) {
markdownParts.push(markdown);
}
}
// Join parts with appropriate spacing
let result = markdownParts.join('\n\n');
// Add references if using reference-style links
if (this.conversionContext.references.length > 0) {
result += '\n\n' + this.generateReferences();
}
// Post-process to clean up
result = this.postProcess(result);
return result;
}
resetContext() {
this.conversionContext = {
listDepth: 0,
inTable: false,
inCode: false,
preserveWhitespace: false,
references: [],
imageCount: 0,
linkCount: 0
};
}
async convertElement(element, context) {
// Skip hidden elements
if (this.isHidden(element)) {
return '';
}
// Skip script and style elements
if (['SCRIPT', 'STYLE', 'NOSCRIPT'].includes(element.tagName)) {
return '';
}
// Get converter for this element type
const converter = this.converters[element.tagName];
if (converter) {
return await converter(element, context);
} else {
// For unknown elements, process children
return await this.processChildren(element, context);
}
}
async processChildren(element, context) {
const parts = [];
for (const child of element.childNodes) {
if (child.nodeType === Node.TEXT_NODE) {
const text = this.processTextNode(child, context);
if (text) {
parts.push(text);
}
} else if (child.nodeType === Node.ELEMENT_NODE) {
const markdown = await this.convertElement(child, context);
if (markdown) {
parts.push(markdown);
}
}
}
return parts.join('');
}
processTextNode(node, context) {
let text = node.textContent;
// Preserve whitespace in code blocks
if (!context.preserveWhitespace && !context.inCode) {
// Normalize whitespace
text = text.replace(/\s+/g, ' ');
// Trim if at block boundaries
if (this.isBlockBoundary(node.previousSibling)) {
text = text.trimStart();
}
if (this.isBlockBoundary(node.nextSibling)) {
text = text.trimEnd();
}
}
// Escape markdown characters
if (!context.inCode) {
text = this.escapeMarkdown(text);
}
return text;
}
isBlockBoundary(node) {
if (!node || node.nodeType !== Node.ELEMENT_NODE) {
return true;
}
const blockElements = [
'DIV', 'P', 'H1', 'H2', 'H3', 'H4', 'H5', 'H6',
'UL', 'OL', 'LI', 'BLOCKQUOTE', 'PRE', 'TABLE',
'HR', 'ARTICLE', 'SECTION', 'HEADER', 'FOOTER',
'NAV', 'ASIDE', 'MAIN'
];
return blockElements.includes(node.tagName);
}
escapeMarkdown(text) {
// In text-only mode, don't escape characters
if (this.options.textOnly) {
return text;
}
// Escape special markdown characters
return text
.replace(/\\/g, '\\\\')
.replace(/\*/g, '\\*')
.replace(/_/g, '\\_')
.replace(/\[/g, '\\[')
.replace(/\]/g, '\\]')
.replace(/\(/g, '\\(')
.replace(/\)/g, '\\)')
.replace(/\#/g, '\\#')
.replace(/\+/g, '\\+')
.replace(/\-/g, '\\-')
.replace(/\./g, '\\.')
.replace(/\!/g, '\\!')
.replace(/\|/g, '\\|');
}
async convertHeading(element, level, context) {
const text = await this.getTextContent(element, context);
return '#'.repeat(level) + ' ' + text + '\n';
}
async convertParagraph(element, context) {
const content = await this.processChildren(element, context);
return content.trim() ? content + '\n' : '';
}
async convertLink(element, context) {
if (!this.options.preserveLinks || this.options.textOnly) {
return await this.getTextContent(element, context);
}
const text = await this.getTextContent(element, context);
const href = element.getAttribute('href');
const title = element.getAttribute('title');
if (!href) {
return text;
}
// Convert relative URLs to absolute
const absoluteUrl = this.makeAbsoluteUrl(href);
// Use reference-style links for cleaner markdown
if (text && absoluteUrl) {
if (title) {
return `[${text}](${absoluteUrl} "${title}")`;
} else {
return `[${text}](${absoluteUrl})`;
}
}
return text;
}
async convertImage(element, context) {
if (!this.options.includeImages || this.options.textOnly) {
// In text-only mode, return alt text if available
if (this.options.textOnly) {
const alt = element.getAttribute('alt');
return alt ? `[Image: ${alt}]` : '';
}
return '';
}
const src = element.getAttribute('src');
const alt = element.getAttribute('alt') || '';
const title = element.getAttribute('title');
if (!src) {
return '';
}
// Convert relative URLs to absolute
const absoluteUrl = this.makeAbsoluteUrl(src);
if (title) {
return `![${alt}](${absoluteUrl} "${title}")`;
} else {
return `![${alt}](${absoluteUrl})`;
}
}
async convertList(element, type, context) {
const oldDepth = context.listDepth;
context.listDepth++;
const items = [];
for (const child of element.children) {
if (child.tagName === 'LI') {
const markdown = await this.convertListItem(child, { ...context, listType: type });
if (markdown) {
items.push(markdown);
}
}
}
context.listDepth = oldDepth;
return items.join('\n') + (context.listDepth === 0 ? '\n' : '');
}
async convertListItem(element, context) {
const indent = ' '.repeat(Math.max(0, context.listDepth - 1));
const bullet = context.listType === 'ol' ? '1.' : '-';
const content = (await this.processChildren(element, context)).trim();
return `${indent}${bullet} ${content}`;
}
async convertTable(element, context) {
if (!this.options.preserveTables || this.options.textOnly) {
// Fallback to simple text representation
return await this.convertTableToText(element, context);
}
const rows = [];
const headerRows = [];
let maxCols = 0;
// Process table rows
for (const child of element.children) {
if (child.tagName === 'THEAD') {
for (const row of child.children) {
if (row.tagName === 'TR') {
const cells = await this.processTableRow(row, context);
headerRows.push(cells);
maxCols = Math.max(maxCols, cells.length);
}
}
} else if (child.tagName === 'TBODY') {
for (const row of child.children) {
if (row.tagName === 'TR') {
const cells = await this.processTableRow(row, context);
rows.push(cells);
maxCols = Math.max(maxCols, cells.length);
}
}
} else if (child.tagName === 'TR') {
const cells = await this.processTableRow(child, context);
rows.push(cells);
maxCols = Math.max(maxCols, cells.length);
}
}
// Build markdown table
const markdownRows = [];
// Add headers
if (headerRows.length > 0) {
for (const headerRow of headerRows) {
const paddedRow = this.padTableRow(headerRow, maxCols);
markdownRows.push('| ' + paddedRow.join(' | ') + ' |');
}
// Add separator
const separator = Array(maxCols).fill('---');
markdownRows.push('| ' + separator.join(' | ') + ' |');
}
// Add body rows
for (const row of rows) {
const paddedRow = this.padTableRow(row, maxCols);
markdownRows.push('| ' + paddedRow.join(' | ') + ' |');
}
return markdownRows.join('\n') + '\n';
}
async processTableRow(row, context) {
const cells = [];
for (const cell of row.children) {
if (cell.tagName === 'TD' || cell.tagName === 'TH') {
const content = (await this.getTextContent(cell, context)).trim();
cells.push(content);
}
}
return cells;
}
async convertTableRow(element, context) {
// Convert a single table row to markdown
if (this.options.textOnly) {
const cells = await this.processTableRow(element, context);
return cells.join(' ');
}
// For non-text-only mode, create a simple table representation
const cells = await this.processTableRow(element, context);
return '| ' + cells.join(' | ') + ' |';
}
padTableRow(row, targetLength) {
const padded = [...row];
while (padded.length < targetLength) {
padded.push('');
}
return padded;
}
async convertTableToText(element, context) {
// Convert table to clean text representation
const lines = [];
const rows = element.querySelectorAll('tr');
for (const row of rows) {
const cells = row.querySelectorAll('td, th');
const cellTexts = [];
for (const cell of cells) {
const text = (await this.getTextContent(cell, context)).trim();
if (text) {
cellTexts.push(text);
}
}
if (cellTexts.length > 0) {
// Join cells with space, handling common patterns
lines.push(cellTexts.join(' '));
}
}
return lines.join('\n');
}
async convertBlockquote(element, context) {
const lines = (await this.processChildren(element, context)).trim().split('\n');
return lines.map(line => '> ' + line).join('\n') + '\n';
}
async convertPreformatted(element, context) {
const oldInCode = context.inCode;
const oldPreserveWhitespace = context.preserveWhitespace;
context.inCode = true;
context.preserveWhitespace = true;
let content = '';
let language = '';
// Check if this is a code block with language
const codeElement = element.querySelector('code');
if (codeElement) {
// Try to detect language from class
const className = codeElement.className;
const langMatch = className.match(/language-(\w+)/);
if (langMatch) {
language = langMatch[1];
}
content = codeElement.textContent;
} else {
content = element.textContent;
}
context.inCode = oldInCode;
context.preserveWhitespace = oldPreserveWhitespace;
// Use fenced code blocks
return '```' + language + '\n' + content + '\n```\n';
}
async convertCode(element, context) {
if (element.parentElement && element.parentElement.tagName === 'PRE') {
// Already handled by convertPreformatted
return element.textContent;
}
const content = element.textContent;
return '`' + content + '`';
}
async convertDiv(element, context) {
// Check for special div types
if (element.className.includes('code-block') ||
element.className.includes('highlight')) {
return await this.convertPreformatted(element, context);
}
const content = await this.processChildren(element, context);
return content.trim() ? content + '\n' : '';
}
async convertSpan(element, context) {
// Check for special span types
if (element.className.includes('code') ||
element.className.includes('inline-code')) {
return this.convertCode(element, context);
}
return await this.processChildren(element, context);
}
async convertArticle(element, context) {
const content = await this.processChildren(element, context);
return content.trim() ? content + '\n' : '';
}
async convertSection(element, context) {
const content = await this.processChildren(element, context);
return content.trim() ? content + '\n' : '';
}
async convertFigure(element, context) {
const content = await this.processChildren(element, context);
return content.trim() ? content + '\n' : '';
}
async convertFigCaption(element, context) {
const caption = await this.getTextContent(element, context);
return caption ? '\n*' + caption + '*\n' : '';
}
async convertVideo(element, context) {
const title = element.getAttribute('title') || 'Video';
if (this.options.textOnly) {
return `[Video: ${title}]`;
}
const src = element.getAttribute('src');
const poster = element.getAttribute('poster');
if (!src) {
return '';
}
// Convert to markdown with poster image if available
if (poster) {
const absolutePoster = this.makeAbsoluteUrl(poster);
const absoluteSrc = this.makeAbsoluteUrl(src);
return `[![${title}](${absolutePoster})](${absoluteSrc})`;
} else {
const absoluteSrc = this.makeAbsoluteUrl(src);
return `[${title}](${absoluteSrc})`;
}
}
async convertIframe(element, context) {
const title = element.getAttribute('title') || 'Embedded content';
if (this.options.textOnly) {
const src = element.getAttribute('src') || '';
if (src.includes('youtube.com') || src.includes('youtu.be')) {
return `[Video: ${title}]`;
} else if (src.includes('vimeo.com')) {
return `[Video: ${title}]`;
} else {
return `[Embedded: ${title}]`;
}
}
const src = element.getAttribute('src');
if (!src) {
return '';
}
// Check for common embeds
if (src.includes('youtube.com') || src.includes('youtu.be')) {
return `[▶️ ${title}](${src})`;
} else if (src.includes('vimeo.com')) {
return `[▶️ ${title}](${src})`;
} else {
return `[${title}](${src})`;
}
}
async convertDefinitionList(element, context) {
return await this.processChildren(element, context) + '\n';
}
async convertDefinitionTerm(element, context) {
const term = await this.getTextContent(element, context);
return '**' + term + '**\n';
}
async convertDefinitionDescription(element, context) {
const description = await this.processChildren(element, context);
return ': ' + description + '\n';
}
async getTextContent(element, context) {
// Special handling for elements that might contain other markdown
if (context.inCode) {
return element.textContent;
}
return await this.processChildren(element, context);
}
makeAbsoluteUrl(url) {
if (!url) return '';
try {
// Check if already absolute
if (url.startsWith('http://') || url.startsWith('https://')) {
return url;
}
// Handle protocol-relative URLs
if (url.startsWith('//')) {
return window.location.protocol + url;
}
// Convert relative to absolute
const base = window.location.origin;
const path = window.location.pathname;
if (url.startsWith('/')) {
return base + url;
} else {
// Relative to current path
const pathDir = path.substring(0, path.lastIndexOf('/') + 1);
return base + pathDir + url;
}
} catch (e) {
return url;
}
}
isHidden(element) {
const style = window.getComputedStyle(element);
return style.display === 'none' ||
style.visibility === 'hidden' ||
style.opacity === '0';
}
generateReferences() {
return this.conversionContext.references
.map((ref, index) => `[${index + 1}]: ${ref.url}`)
.join('\n');
}
postProcess(markdown) {
// Apply text-only specific processing
if (this.options.textOnly) {
markdown = this.postProcessTextOnly(markdown);
}
// Clean up excessive newlines
markdown = markdown.replace(/\n{3,}/g, '\n\n');
// Clean up spaces before punctuation
markdown = markdown.replace(/ +([.,;:!?])/g, '$1');
// Ensure proper spacing around headers
markdown = markdown.replace(/\n(#{1,6} )/g, '\n\n$1');
markdown = markdown.replace(/(#{1,6} .+)\n(?![\n#])/g, '$1\n\n');
// Clean up list spacing
markdown = markdown.replace(/\n\n(-|\d+\.) /g, '\n$1 ');
// Trim final result
return markdown.trim();
}
postProcessTextOnly(markdown) {
// Smart pattern recognition for common formats
const lines = markdown.split('\n');
const processedLines = [];
let inMetadata = false;
let currentItem = null;
for (let i = 0; i < lines.length; i++) {
const line = lines[i].trim();
if (!line) {
processedLines.push('');
continue;
}
// Detect numbered list items (common in HN, Reddit, etc.)
const numberPattern = /^(\d+)\.\s*(.+)$/;
const numberMatch = line.match(numberPattern);
if (numberMatch) {
// Start of a new numbered item
inMetadata = false;
currentItem = numberMatch[1];
const content = numberMatch[2];
// Check if content has domain in parentheses
const domainPattern = /^(.+?)\s*\(([^)]+)\)\s*(.*)$/;
const domainMatch = content.match(domainPattern);
if (domainMatch) {
const [, title, domain, rest] = domainMatch;
processedLines.push(`${currentItem}. **${title.trim()}** (${domain})`);
if (rest.trim()) {
processedLines.push(` ${rest.trim()}`);
inMetadata = true;
}
} else {
processedLines.push(`${currentItem}. **${content}**`);
}
} else if (line.match(/\b(points?|by|ago|hide|comments?)\b/i) && currentItem) {
// This looks like metadata for the current item
const cleanedLine = line
.replace(/\s+/g, ' ')
.replace(/\s*\|\s*/g, ' | ')
.trim();
processedLines.push(` ${cleanedLine}`);
inMetadata = true;
} else if (inMetadata && line.length < 100) {
// Continue metadata if we're in metadata mode and line is short
processedLines.push(` ${line}`);
} else {
// Regular content
inMetadata = false;
processedLines.push(line);
}
}
// Clean up the output
let result = processedLines.join('\n');
// Remove excessive blank lines
result = result.replace(/\n{3,}/g, '\n\n');
// Ensure proper spacing after numbered items
result = result.replace(/^(\d+\..+)$\n^(?!\s)/gm, '$1\n\n');
return result;
}
}

View File

@@ -0,0 +1,701 @@
class MarkdownExtraction {
constructor() {
this.selectedElements = new Set();
this.highlightBoxes = new Map();
this.selectionMode = false;
this.toolbar = null;
this.markdownPreviewModal = null;
this.selectionCounter = 0;
this.markdownConverter = null;
this.contentAnalyzer = null;
this.init();
}
async init() {
// Initialize dependencies
this.markdownConverter = new MarkdownConverter();
this.contentAnalyzer = new ContentAnalyzer();
this.createToolbar();
this.setupEventListeners();
}
createToolbar() {
// Create floating toolbar
this.toolbar = document.createElement('div');
this.toolbar.className = 'c4ai-c2c-toolbar';
this.toolbar.innerHTML = `
<div class="c4ai-toolbar-header">
<div class="c4ai-toolbar-dots">
<span class="c4ai-dot c4ai-dot-red"></span>
<span class="c4ai-dot c4ai-dot-yellow"></span>
<span class="c4ai-dot c4ai-dot-green"></span>
</div>
<span class="c4ai-toolbar-title">Markdown Extraction</span>
<button class="c4ai-close-btn" title="Close">×</button>
</div>
<div class="c4ai-toolbar-content">
<div class="c4ai-selection-info">
<span class="c4ai-selection-count">0 elements selected</span>
<button class="c4ai-clear-btn" title="Clear selection" disabled>Clear</button>
</div>
<div class="c4ai-toolbar-actions">
<button class="c4ai-preview-btn" disabled>Preview Markdown</button>
<button class="c4ai-copy-btn" disabled>Copy to Clipboard</button>
</div>
<div class="c4ai-toolbar-instructions">
<p>💡 <strong>Ctrl/Cmd + Click</strong> to select multiple elements</p>
<p>📝 Selected elements will be converted to clean markdown</p>
<p>⌨️ Press <strong>ESC</strong> to exit</p>
</div>
</div>
`;
document.body.appendChild(this.toolbar);
makeDraggableByHeader(this.toolbar);
// Position toolbar
this.toolbar.style.position = 'fixed';
this.toolbar.style.top = '20px';
this.toolbar.style.right = '20px';
this.toolbar.style.zIndex = '999999';
}
setupEventListeners() {
// Close button
this.toolbar.querySelector('.c4ai-close-btn').addEventListener('click', () => {
this.deactivate();
});
// Clear selection button
this.toolbar.querySelector('.c4ai-clear-btn').addEventListener('click', () => {
this.clearSelection();
});
// Preview button
this.toolbar.querySelector('.c4ai-preview-btn').addEventListener('click', () => {
this.showPreview();
});
// Copy button
this.toolbar.querySelector('.c4ai-copy-btn').addEventListener('click', () => {
this.copyToClipboard();
});
// Document click handler for element selection
this.documentClickHandler = (event) => this.handleElementClick(event);
document.addEventListener('click', this.documentClickHandler, true);
// Prevent default link behavior during selection mode
this.linkClickHandler = (event) => {
if (event.ctrlKey || event.metaKey) {
event.preventDefault();
event.stopPropagation();
}
};
document.addEventListener('click', this.linkClickHandler, true);
// Hover effect
this.documentHoverHandler = (event) => this.handleElementHover(event);
document.addEventListener('mouseover', this.documentHoverHandler, true);
// Remove hover on mouseout
this.documentMouseOutHandler = (event) => this.handleElementMouseOut(event);
document.addEventListener('mouseout', this.documentMouseOutHandler, true);
// Keyboard shortcuts
this.keyboardHandler = (event) => this.handleKeyboard(event);
document.addEventListener('keydown', this.keyboardHandler);
}
handleElementClick(event) {
// Check if Ctrl/Cmd is pressed
if (!event.ctrlKey && !event.metaKey) return;
// Prevent default behavior
event.preventDefault();
event.stopPropagation();
const element = event.target;
// Don't select our own UI elements
if (element.closest('.c4ai-c2c-toolbar') ||
element.closest('.c4ai-c2c-preview') ||
element.closest('.c4ai-highlight-box')) {
return;
}
// Toggle element selection
if (this.selectedElements.has(element)) {
this.deselectElement(element);
} else {
this.selectElement(element);
}
this.updateUI();
}
handleElementHover(event) {
const element = event.target;
// Don't hover our own UI elements
if (element.closest('.c4ai-c2c-toolbar') ||
element.closest('.c4ai-c2c-preview') ||
element.closest('.c4ai-highlight-box') ||
element.hasAttribute('data-c4ai-badge')) {
return;
}
// Add hover class
element.classList.add('c4ai-hover-candidate');
}
handleElementMouseOut(event) {
const element = event.target;
element.classList.remove('c4ai-hover-candidate');
}
handleKeyboard(event) {
// ESC to deactivate
if (event.key === 'Escape') {
this.deactivate();
}
// Ctrl/Cmd + A to select all visible elements
else if ((event.ctrlKey || event.metaKey) && event.key === 'a') {
event.preventDefault();
// Select all visible text-containing elements
const elements = document.querySelectorAll('p, h1, h2, h3, h4, h5, h6, li, td, th, div, span, article, section');
elements.forEach(el => {
if (el.textContent.trim() && this.isVisible(el) && !this.selectedElements.has(el)) {
this.selectElement(el);
}
});
this.updateUI();
}
}
isVisible(element) {
const rect = element.getBoundingClientRect();
const style = window.getComputedStyle(element);
return rect.width > 0 &&
rect.height > 0 &&
style.display !== 'none' &&
style.visibility !== 'hidden' &&
style.opacity !== '0';
}
selectElement(element) {
this.selectedElements.add(element);
// Create highlight box
const box = this.createHighlightBox(element);
this.highlightBoxes.set(element, box);
// Add selected class
element.classList.add('c4ai-selected');
this.selectionCounter++;
}
deselectElement(element) {
this.selectedElements.delete(element);
// Remove highlight box (badge)
const badge = this.highlightBoxes.get(element);
if (badge) {
// Remove scroll/resize listeners
if (badge._updatePosition) {
window.removeEventListener('scroll', badge._updatePosition, true);
window.removeEventListener('resize', badge._updatePosition);
}
badge.remove();
this.highlightBoxes.delete(element);
}
// Remove outline
element.style.outline = '';
element.style.outlineOffset = '';
// Remove attributes
element.removeAttribute('data-c4ai-selection-order');
element.classList.remove('c4ai-selected');
this.selectionCounter--;
}
createHighlightBox(element) {
// Add a data attribute to track selection order
element.setAttribute('data-c4ai-selection-order', this.selectionCounter + 1);
// Add selection outline directly to the element
element.style.outline = '2px solid #0fbbaa';
element.style.outlineOffset = '2px';
// Create badge with fixed positioning
const badge = document.createElement('div');
badge.className = 'c4ai-selection-badge-fixed';
badge.textContent = this.selectionCounter + 1;
badge.setAttribute('data-c4ai-badge', 'true');
badge.title = 'Click to deselect';
// Get element position and set badge position
const rect = element.getBoundingClientRect();
badge.style.cssText = `
position: fixed !important;
top: ${rect.top - 12}px !important;
left: ${rect.left - 12}px !important;
width: 24px !important;
height: 24px !important;
background: #0fbbaa !important;
color: #070708 !important;
border-radius: 50% !important;
display: flex !important;
align-items: center !important;
justify-content: center !important;
font-size: 12px !important;
font-weight: bold !important;
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif !important;
box-shadow: 0 2px 8px rgba(0, 0, 0, 0.3) !important;
z-index: 999998 !important;
cursor: pointer !important;
transition: all 0.2s ease !important;
pointer-events: auto !important;
border: none !important;
padding: 0 !important;
margin: 0 !important;
line-height: 1 !important;
text-align: center !important;
text-decoration: none !important;
box-sizing: border-box !important;
`;
// Add hover styles dynamically
badge.addEventListener('mouseenter', () => {
badge.style.setProperty('background', '#ff3c74', 'important');
badge.style.setProperty('transform', 'scale(1.1)', 'important');
});
badge.addEventListener('mouseleave', () => {
badge.style.setProperty('background', '#0fbbaa', 'important');
badge.style.setProperty('transform', 'scale(1)', 'important');
});
// Add click handler to badge for deselection
badge.addEventListener('click', (e) => {
e.stopPropagation();
e.preventDefault();
this.deselectElement(element);
this.updateUI();
});
// Add scroll listener to update position
const updatePosition = () => {
const newRect = element.getBoundingClientRect();
badge.style.top = `${newRect.top - 12}px`;
badge.style.left = `${newRect.left - 12}px`;
};
// Store the update function so we can remove it later
badge._updatePosition = updatePosition;
window.addEventListener('scroll', updatePosition, true);
window.addEventListener('resize', updatePosition);
document.body.appendChild(badge);
return badge;
}
clearSelection() {
// Clear all selections
this.selectedElements.forEach(element => {
// Remove badge
const badge = this.highlightBoxes.get(element);
if (badge) {
// Remove scroll/resize listeners
if (badge._updatePosition) {
window.removeEventListener('scroll', badge._updatePosition, true);
window.removeEventListener('resize', badge._updatePosition);
}
badge.remove();
}
// Remove outline
element.style.outline = '';
element.style.outlineOffset = '';
// Remove attributes
element.removeAttribute('data-c4ai-selection-order');
element.classList.remove('c4ai-selected');
});
this.selectedElements.clear();
this.highlightBoxes.clear();
this.selectionCounter = 0;
this.updateUI();
}
updateUI() {
const count = this.selectedElements.size;
// Update selection count
this.toolbar.querySelector('.c4ai-selection-count').textContent =
`${count} element${count !== 1 ? 's' : ''} selected`;
// Enable/disable buttons
const hasSelection = count > 0;
this.toolbar.querySelector('.c4ai-preview-btn').disabled = !hasSelection;
this.toolbar.querySelector('.c4ai-copy-btn').disabled = !hasSelection;
this.toolbar.querySelector('.c4ai-clear-btn').disabled = !hasSelection;
}
async showPreview() {
// Initialize markdown preview modal if not already done
if (!this.markdownPreviewModal) {
this.markdownPreviewModal = new MarkdownPreviewModal();
}
// Show modal with callback to generate markdown
this.markdownPreviewModal.show(async (options) => {
return await this.generateMarkdown(options);
});
}
/* createPreviewPanel() {
this.previewPanel = document.createElement('div');
this.previewPanel.className = 'c4ai-c2c-preview';
this.previewPanel.innerHTML = `
<div class="c4ai-preview-header">
<div class="c4ai-toolbar-dots">
<span class="c4ai-dot c4ai-dot-red"></span>
<span class="c4ai-dot c4ai-dot-yellow"></span>
<span class="c4ai-dot c4ai-dot-green"></span>
</div>
<span class="c4ai-preview-title">Markdown Preview</span>
<button class="c4ai-preview-close">×</button>
</div>
<div class="c4ai-preview-options">
<label><input type="checkbox" name="textOnly"> 👁️ Visual Text Mode (As You See) TRY THIS!!!</label>
<label><input type="checkbox" name="includeImages" checked> Include Images</label>
<label><input type="checkbox" name="preserveTables" checked> Preserve Tables</label>
<label><input type="checkbox" name="preserveLinks" checked> Preserve Links</label>
<label><input type="checkbox" name="keepCodeFormatting" checked> Keep Code Formatting</label>
<label><input type="checkbox" name="simplifyLayout"> Simplify Layout</label>
<label><input type="checkbox" name="addSeparators" checked> Add Separators</label>
<label><input type="checkbox" name="includeXPath"> Include XPath Headers</label>
</div>
<div class="c4ai-preview-content">
<div class="c4ai-preview-tabs">
<button class="c4ai-tab active" data-tab="preview">Preview</button>
<button class="c4ai-tab" data-tab="markdown">Markdown</button>
<button class="c4ai-wrap-toggle" title="Toggle word wrap">↔️ Wrap</button>
</div>
<div class="c4ai-preview-pane active" data-pane="preview"></div>
<div class="c4ai-preview-pane" data-pane="markdown"></div>
</div>
<div class="c4ai-preview-actions">
<button class="c4ai-download-btn">Download .md</button>
<button class="c4ai-copy-markdown-btn">Copy Markdown</button>
<button class="c4ai-cloud-btn" disabled>Send to Cloud (Coming Soon)</button>
</div>
`;
document.body.appendChild(this.previewPanel);
makeDraggableByHeader(this.previewPanel);
// Position preview panel
this.previewPanel.style.position = 'fixed';
this.previewPanel.style.top = '50%';
this.previewPanel.style.left = '50%';
this.previewPanel.style.transform = 'translate(-50%, -50%)';
this.previewPanel.style.zIndex = '999999';
this.setupPreviewEventListeners();
} */
/* setupPreviewEventListeners() {
// Close button
this.previewPanel.querySelector('.c4ai-preview-close').addEventListener('click', () => {
this.previewPanel.style.display = 'none';
});
// Tab switching
this.previewPanel.querySelectorAll('.c4ai-tab').forEach(tab => {
tab.addEventListener('click', (e) => {
const tabName = e.target.dataset.tab;
this.switchPreviewTab(tabName);
});
});
// Wrap toggle
const wrapToggle = this.previewPanel.querySelector('.c4ai-wrap-toggle');
wrapToggle.addEventListener('click', () => {
const panes = this.previewPanel.querySelectorAll('.c4ai-preview-pane');
panes.forEach(pane => {
pane.classList.toggle('wrap');
});
wrapToggle.classList.toggle('active');
});
// Options change
this.previewPanel.querySelectorAll('input[type="checkbox"]').forEach(checkbox => {
checkbox.addEventListener('change', async (e) => {
this.options[e.target.name] = e.target.checked;
// If text-only is enabled, automatically disable certain options
if (e.target.name === 'textOnly' && e.target.checked) {
// Update UI checkboxes
const preserveLinksCheckbox = this.previewPanel.querySelector('input[name="preserveLinks"]');
if (preserveLinksCheckbox) {
preserveLinksCheckbox.checked = false;
preserveLinksCheckbox.disabled = true;
}
// Optionally disable images in text-only mode
const includeImagesCheckbox = this.previewPanel.querySelector('input[name="includeImages"]');
if (includeImagesCheckbox) {
includeImagesCheckbox.disabled = true;
}
} else if (e.target.name === 'textOnly' && !e.target.checked) {
// Re-enable options when text-only is disabled
const preserveLinksCheckbox = this.previewPanel.querySelector('input[name="preserveLinks"]');
if (preserveLinksCheckbox) {
preserveLinksCheckbox.disabled = false;
}
const includeImagesCheckbox = this.previewPanel.querySelector('input[name="includeImages"]');
if (includeImagesCheckbox) {
includeImagesCheckbox.disabled = false;
}
}
const markdown = await this.generateMarkdown();
await this.updatePreviewContent(markdown);
});
});
// Action buttons
this.previewPanel.querySelector('.c4ai-copy-markdown-btn').addEventListener('click', () => {
this.copyToClipboard();
});
this.previewPanel.querySelector('.c4ai-download-btn').addEventListener('click', () => {
this.downloadMarkdown();
});
} */
/* switchPreviewTab(tabName) {
// Update active tab
this.previewPanel.querySelectorAll('.c4ai-tab').forEach(tab => {
tab.classList.toggle('active', tab.dataset.tab === tabName);
});
// Update active pane
this.previewPanel.querySelectorAll('.c4ai-preview-pane').forEach(pane => {
pane.classList.toggle('active', pane.dataset.pane === tabName);
});
} */
/* async updatePreviewContent(markdown) {
// Update markdown pane
const markdownPane = this.previewPanel.querySelector('[data-pane="markdown"]');
markdownPane.innerHTML = `<pre><code>${this.escapeHtml(markdown)}</code></pre>`;
// Update preview pane using marked.js
const previewPane = this.previewPanel.querySelector('[data-pane="preview"]');
// Configure marked options (marked.js is already loaded via manifest)
if (window.marked) {
marked.setOptions({
gfm: true,
breaks: true,
tables: true,
headerIds: false,
mangle: false
});
// Render markdown to HTML
const html = marked.parse(markdown);
previewPane.innerHTML = `<div class="c4ai-markdown-preview">${html}</div>`;
} else {
// Fallback if marked.js is not available
previewPane.innerHTML = `<div class="c4ai-markdown-preview"><pre>${this.escapeHtml(markdown)}</pre></div>`;
}
} */
/* escapeHtml(unsafe) {
return unsafe
.replace(/&/g, "&amp;")
.replace(/</g, "&lt;")
.replace(/>/g, "&gt;")
.replace(/"/g, "&quot;")
.replace(/'/g, "&#039;");
} */
async generateMarkdown(options) {
// Get selected elements as array
const elements = Array.from(this.selectedElements);
// Sort elements by their selection order
const sortedElements = elements.sort((a, b) => {
const orderA = parseInt(a.getAttribute('data-c4ai-selection-order') || '0');
const orderB = parseInt(b.getAttribute('data-c4ai-selection-order') || '0');
return orderA - orderB;
});
// Convert each element separately
const markdownParts = [];
for (let i = 0; i < sortedElements.length; i++) {
const element = sortedElements[i];
// Add XPath header if enabled
if (options.includeXPath) {
const xpath = this.getXPath(element);
markdownParts.push(`### Element ${i + 1} - XPath: \`${xpath}\`\n`);
}
// Check if element is part of a table structure that should be processed specially
let elementsToConvert = [element];
// If text-only mode and element is a TR, process the entire table for better context
if (options.textOnly && element.tagName === 'TR') {
const table = element.closest('table');
if (table && !sortedElements.includes(table)) {
// Only include this table row, not the whole table
elementsToConvert = [element];
}
}
// Analyze and convert individual element
const analysis = await this.contentAnalyzer.analyze(elementsToConvert);
const markdown = await this.markdownConverter.convert(elementsToConvert, {
...options,
analysis
});
// Trim the markdown before adding
const trimmedMarkdown = markdown.trim();
markdownParts.push(trimmedMarkdown);
// Add separator if enabled and not last element
if (options.addSeparators && i < sortedElements.length - 1) {
markdownParts.push('\n---\n');
}
}
return markdownParts.join('\n');
}
getXPath(element) {
if (element.id) {
return `//*[@id="${element.id}"]`;
}
const parts = [];
let current = element;
while (current && current.nodeType === Node.ELEMENT_NODE) {
let index = 0;
let sibling = current.previousSibling;
while (sibling) {
if (sibling.nodeType === Node.ELEMENT_NODE && sibling.nodeName === current.nodeName) {
index++;
}
sibling = sibling.previousSibling;
}
const tagName = current.nodeName.toLowerCase();
const part = index > 0 ? `${tagName}[${index + 1}]` : tagName;
parts.unshift(part);
current = current.parentNode;
}
return '/' + parts.join('/');
}
sortElementsByPosition(elements) {
return elements.sort((a, b) => {
const position = a.compareDocumentPosition(b);
if (position & Node.DOCUMENT_POSITION_FOLLOWING) {
return -1;
} else if (position & Node.DOCUMENT_POSITION_PRECEDING) {
return 1;
}
return 0;
});
}
async copyToClipboard() {
if (this.markdownPreviewModal) {
await this.markdownPreviewModal.copyToClipboard();
}
}
async downloadMarkdown() {
if (this.markdownPreviewModal) {
await this.markdownPreviewModal.downloadMarkdown();
}
}
showNotification(message, type = 'success') {
const notification = document.createElement('div');
notification.className = `c4ai-notification c4ai-notification-${type}`;
notification.textContent = message;
document.body.appendChild(notification);
// Animate in
setTimeout(() => notification.classList.add('show'), 10);
// Remove after 3 seconds
setTimeout(() => {
notification.classList.remove('show');
setTimeout(() => notification.remove(), 300);
}, 3000);
}
deactivate() {
// Remove event listeners
document.removeEventListener('click', this.documentClickHandler, true);
document.removeEventListener('click', this.linkClickHandler, true);
document.removeEventListener('mouseover', this.documentHoverHandler, true);
document.removeEventListener('mouseout', this.documentMouseOutHandler, true);
document.removeEventListener('keydown', this.keyboardHandler);
// Clear selections
this.clearSelection();
// Remove UI elements
if (this.toolbar) {
this.toolbar.remove();
this.toolbar = null;
}
if (this.markdownPreviewModal) {
this.markdownPreviewModal.destroy();
this.markdownPreviewModal = null;
}
// Remove hover styles
document.querySelectorAll('.c4ai-hover-candidate').forEach(el => {
el.classList.remove('c4ai-hover-candidate');
});
// Notify background script (with error handling)
try {
if (chrome.runtime && chrome.runtime.sendMessage) {
chrome.runtime.sendMessage({
action: 'c2cDeactivated'
});
}
} catch (error) {
// Extension context might be invalidated, ignore the error
console.log('Markdown Extraction deactivated (extension context unavailable)');
}
}
}

View File

@@ -0,0 +1,300 @@
// Shared Markdown Preview Modal Component for Crawl4AI Assistant
// Used by both SchemaBuilder and Click2CrawlBuilder
class MarkdownPreviewModal {
constructor(options = {}) {
this.modal = null;
this.markdownOptions = {
includeImages: true,
preserveTables: true,
keepCodeFormatting: true,
simplifyLayout: false,
preserveLinks: true,
addSeparators: true,
includeXPath: false,
textOnly: false,
...options
};
this.onGenerateMarkdown = null;
this.currentMarkdown = '';
}
show(generateMarkdownCallback) {
this.onGenerateMarkdown = generateMarkdownCallback;
if (!this.modal) {
this.createModal();
}
// Generate initial markdown
this.updateContent();
this.modal.style.display = 'block';
}
hide() {
if (this.modal) {
this.modal.style.display = 'none';
}
}
createModal() {
this.modal = document.createElement('div');
this.modal.className = 'c4ai-c2c-preview';
this.modal.innerHTML = `
<div class="c4ai-preview-header">
<div class="c4ai-toolbar-dots">
<span class="c4ai-dot c4ai-dot-red"></span>
<span class="c4ai-dot c4ai-dot-yellow"></span>
<span class="c4ai-dot c4ai-dot-green"></span>
</div>
<span class="c4ai-preview-title">Markdown Preview</span>
<button class="c4ai-preview-close">×</button>
</div>
<div class="c4ai-preview-options">
<label><input type="checkbox" name="textOnly"> 👁️ Visual Text Mode (As You See)</label>
<label><input type="checkbox" name="includeImages" checked> Include Images</label>
<label><input type="checkbox" name="preserveTables" checked> Preserve Tables</label>
<label><input type="checkbox" name="preserveLinks" checked> Preserve Links</label>
<label><input type="checkbox" name="keepCodeFormatting" checked> Keep Code Formatting</label>
<label><input type="checkbox" name="simplifyLayout"> Simplify Layout</label>
<label><input type="checkbox" name="addSeparators" checked> Add Separators</label>
<label><input type="checkbox" name="includeXPath"> Include XPath Headers</label>
</div>
<div class="c4ai-preview-content">
<div class="c4ai-preview-tabs">
<button class="c4ai-tab active" data-tab="preview">Preview</button>
<button class="c4ai-tab" data-tab="markdown">Markdown</button>
<button class="c4ai-wrap-toggle" title="Toggle word wrap">↔️ Wrap</button>
</div>
<div class="c4ai-preview-pane active" data-pane="preview"></div>
<div class="c4ai-preview-pane" data-pane="markdown"></div>
</div>
<div class="c4ai-preview-actions">
<button class="c4ai-download-btn">Download .md</button>
<button class="c4ai-copy-markdown-btn">Copy Markdown</button>
<button class="c4ai-cloud-btn" disabled>Send to Cloud (Coming Soon)</button>
</div>
`;
document.body.appendChild(this.modal);
// Make modal draggable
if (window.C4AI_Utils && window.C4AI_Utils.makeDraggable) {
window.C4AI_Utils.makeDraggable(this.modal);
}
// Position preview modal
this.modal.style.position = 'fixed';
this.modal.style.top = '50%';
this.modal.style.left = '50%';
this.modal.style.transform = 'translate(-50%, -50%)';
this.modal.style.zIndex = '999999';
this.setupEventListeners();
}
setupEventListeners() {
// Close button
this.modal.querySelector('.c4ai-preview-close').addEventListener('click', () => {
this.hide();
});
// Tab switching
this.modal.querySelectorAll('.c4ai-tab').forEach(tab => {
tab.addEventListener('click', (e) => {
const tabName = e.target.dataset.tab;
this.switchTab(tabName);
});
});
// Wrap toggle
const wrapToggle = this.modal.querySelector('.c4ai-wrap-toggle');
wrapToggle.addEventListener('click', () => {
const panes = this.modal.querySelectorAll('.c4ai-preview-pane');
panes.forEach(pane => {
pane.classList.toggle('wrap');
});
wrapToggle.classList.toggle('active');
});
// Options change
this.modal.querySelectorAll('input[type="checkbox"]').forEach(checkbox => {
checkbox.addEventListener('change', async (e) => {
this.markdownOptions[e.target.name] = e.target.checked;
// Handle text-only mode dependencies
if (e.target.name === 'textOnly' && e.target.checked) {
const preserveLinksCheckbox = this.modal.querySelector('input[name="preserveLinks"]');
if (preserveLinksCheckbox) {
preserveLinksCheckbox.checked = false;
preserveLinksCheckbox.disabled = true;
this.markdownOptions.preserveLinks = false;
}
const includeImagesCheckbox = this.modal.querySelector('input[name="includeImages"]');
if (includeImagesCheckbox) {
includeImagesCheckbox.disabled = true;
}
} else if (e.target.name === 'textOnly' && !e.target.checked) {
// Re-enable options when text-only is disabled
const preserveLinksCheckbox = this.modal.querySelector('input[name="preserveLinks"]');
if (preserveLinksCheckbox) {
preserveLinksCheckbox.disabled = false;
}
const includeImagesCheckbox = this.modal.querySelector('input[name="includeImages"]');
if (includeImagesCheckbox) {
includeImagesCheckbox.disabled = false;
}
}
// Update markdown content
await this.updateContent();
});
});
// Action buttons
this.modal.querySelector('.c4ai-copy-markdown-btn').addEventListener('click', () => {
this.copyToClipboard();
});
this.modal.querySelector('.c4ai-download-btn').addEventListener('click', () => {
this.downloadMarkdown();
});
}
switchTab(tabName) {
// Update active tab
this.modal.querySelectorAll('.c4ai-tab').forEach(tab => {
tab.classList.toggle('active', tab.dataset.tab === tabName);
});
// Update active pane
this.modal.querySelectorAll('.c4ai-preview-pane').forEach(pane => {
pane.classList.toggle('active', pane.dataset.pane === tabName);
});
}
async updateContent() {
if (!this.onGenerateMarkdown) return;
try {
// Generate markdown with current options
this.currentMarkdown = await this.onGenerateMarkdown(this.markdownOptions);
// Update markdown pane
const markdownPane = this.modal.querySelector('[data-pane="markdown"]');
markdownPane.innerHTML = `<pre><code>${this.escapeHtml(this.currentMarkdown)}</code></pre>`;
// Update preview pane
const previewPane = this.modal.querySelector('[data-pane="preview"]');
// Use marked.js if available
if (window.marked) {
marked.setOptions({
gfm: true,
breaks: true,
tables: true,
headerIds: false,
mangle: false
});
const html = marked.parse(this.currentMarkdown);
previewPane.innerHTML = `<div class="c4ai-markdown-preview">${html}</div>`;
} else {
// Fallback
previewPane.innerHTML = `<div class="c4ai-markdown-preview"><pre>${this.escapeHtml(this.currentMarkdown)}</pre></div>`;
}
} catch (error) {
console.error('Error generating markdown:', error);
this.showNotification('Error generating markdown', 'error');
}
}
async copyToClipboard() {
try {
await navigator.clipboard.writeText(this.currentMarkdown);
this.showNotification('Markdown copied to clipboard!');
} catch (err) {
console.error('Failed to copy:', err);
this.showNotification('Failed to copy. Please try again.', 'error');
}
}
async downloadMarkdown() {
const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, -5);
const filename = `crawl4ai-export-${timestamp}.md`;
// Create blob and download
const blob = new Blob([this.currentMarkdown], { type: 'text/markdown' });
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = filename;
document.body.appendChild(a);
a.click();
document.body.removeChild(a);
URL.revokeObjectURL(url);
this.showNotification(`Downloaded ${filename}`);
}
showNotification(message, type = 'success') {
const notification = document.createElement('div');
notification.className = `c4ai-notification c4ai-notification-${type}`;
notification.textContent = message;
document.body.appendChild(notification);
// Animate in
setTimeout(() => notification.classList.add('show'), 10);
// Remove after 3 seconds
setTimeout(() => {
notification.classList.remove('show');
setTimeout(() => notification.remove(), 300);
}, 3000);
}
escapeHtml(unsafe) {
return unsafe
.replace(/&/g, "&amp;")
.replace(/</g, "&lt;")
.replace(/>/g, "&gt;")
.replace(/"/g, "&quot;")
.replace(/'/g, "&#039;");
}
// Get current options
getOptions() {
return { ...this.markdownOptions };
}
// Update options programmatically
setOptions(options) {
this.markdownOptions = { ...this.markdownOptions, ...options };
// Update checkboxes to reflect new options
Object.entries(options).forEach(([key, value]) => {
const checkbox = this.modal?.querySelector(`input[name="${key}"]`);
if (checkbox && typeof value === 'boolean') {
checkbox.checked = value;
}
});
}
// Cleanup
destroy() {
if (this.modal) {
this.modal.remove();
this.modal = null;
}
this.onGenerateMarkdown = null;
}
}
// Export for use in other scripts
if (typeof window !== 'undefined') {
window.MarkdownPreviewModal = MarkdownPreviewModal;
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,253 @@
// Shared utilities for Crawl4AI Chrome Extension
// Make element draggable by its titlebar
function makeDraggable(element) {
let isDragging = false;
let startX, startY, initialX, initialY;
const titlebar = element.querySelector('.c4ai-toolbar-titlebar, .c4ai-titlebar');
if (!titlebar) return;
titlebar.addEventListener('mousedown', (e) => {
// Don't drag if clicking on buttons
if (e.target.classList.contains('c4ai-dot') || e.target.closest('button')) return;
isDragging = true;
startX = e.clientX;
startY = e.clientY;
const rect = element.getBoundingClientRect();
initialX = rect.left;
initialY = rect.top;
element.style.transition = 'none';
titlebar.style.cursor = 'grabbing';
});
document.addEventListener('mousemove', (e) => {
if (!isDragging) return;
const deltaX = e.clientX - startX;
const deltaY = e.clientY - startY;
element.style.left = `${initialX + deltaX}px`;
element.style.top = `${initialY + deltaY}px`;
element.style.right = 'auto';
});
document.addEventListener('mouseup', () => {
if (isDragging) {
isDragging = false;
element.style.transition = '';
titlebar.style.cursor = 'grab';
}
});
}
// Make element draggable by a specific header element
function makeDraggableByHeader(element) {
let isDragging = false;
let startX, startY, initialX, initialY;
const header = element.querySelector('.c4ai-debugger-header');
if (!header) return;
header.addEventListener('mousedown', (e) => {
// Don't drag if clicking on close button
if (e.target.id === 'c4ai-close-debugger' || e.target.closest('#c4ai-close-debugger')) return;
isDragging = true;
startX = e.clientX;
startY = e.clientY;
const rect = element.getBoundingClientRect();
initialX = rect.left;
initialY = rect.top;
element.style.transition = 'none';
header.style.cursor = 'grabbing';
});
document.addEventListener('mousemove', (e) => {
if (!isDragging) return;
const deltaX = e.clientX - startX;
const deltaY = e.clientY - startY;
element.style.left = `${initialX + deltaX}px`;
element.style.top = `${initialY + deltaY}px`;
element.style.right = 'auto';
});
document.addEventListener('mouseup', () => {
if (isDragging) {
isDragging = false;
element.style.transition = '';
header.style.cursor = 'grab';
}
});
}
// Escape HTML for safe display
function escapeHtml(text) {
const div = document.createElement('div');
div.textContent = text;
return div.innerHTML;
}
// Apply syntax highlighting to Python code
function applySyntaxHighlighting(codeElement) {
const code = codeElement.textContent;
// Split by lines to handle line-by-line
const lines = code.split('\n');
const highlightedLines = lines.map(line => {
let highlightedLine = escapeHtml(line);
// Skip if line is empty
if (!highlightedLine.trim()) return highlightedLine;
// Comments (lines starting with #)
if (highlightedLine.trim().startsWith('#')) {
return `<span class="c4ai-comment">${highlightedLine}</span>`;
}
// Triple quoted strings
if (highlightedLine.includes('"""')) {
highlightedLine = highlightedLine.replace(/(""".*?""")/g, '<span class="c4ai-string">$1</span>');
}
// Regular strings - single and double quotes
highlightedLine = highlightedLine.replace(/(["'])([^"']*)\1/g, '<span class="c4ai-string">$1$2$1</span>');
// Keywords - only highlight if not inside a string
const keywords = ['import', 'from', 'async', 'def', 'await', 'try', 'except', 'with', 'as', 'for', 'if', 'else', 'elif', 'return', 'print', 'open', 'and', 'or', 'not', 'in', 'is', 'class', 'self', 'None', 'True', 'False', '__name__', '__main__'];
keywords.forEach(keyword => {
// Use word boundaries and lookahead/lookbehind to ensure we're not in a string
const regex = new RegExp(`\\b(${keyword})\\b(?![^<]*</span>)`, 'g');
highlightedLine = highlightedLine.replace(regex, '<span class="c4ai-keyword">$1</span>');
});
// Functions (word followed by parenthesis)
highlightedLine = highlightedLine.replace(/\b([a-zA-Z_]\w*)\s*\(/g, '<span class="c4ai-function">$1</span>(');
return highlightedLine;
});
codeElement.innerHTML = highlightedLines.join('\n');
}
// Apply syntax highlighting to JavaScript code
function applySyntaxHighlightingJS(codeElement) {
const code = codeElement.textContent;
// Split by lines to handle line-by-line
const lines = code.split('\n');
const highlightedLines = lines.map(line => {
let highlightedLine = escapeHtml(line);
// Skip if line is empty
if (!highlightedLine.trim()) return highlightedLine;
// Comments
if (highlightedLine.trim().startsWith('//')) {
return `<span class="c4ai-comment">${highlightedLine}</span>`;
}
// Multi-line comments
highlightedLine = highlightedLine.replace(/(\/\*.*?\*\/)/g, '<span class="c4ai-comment">$1</span>');
// Template literals
highlightedLine = highlightedLine.replace(/(`[^`]*`)/g, '<span class="c4ai-string">$1</span>');
// Regular strings - single and double quotes
highlightedLine = highlightedLine.replace(/(["'])([^"']*)\1/g, '<span class="c4ai-string">$1$2$1</span>');
// Keywords
const keywords = ['const', 'let', 'var', 'function', 'async', 'await', 'if', 'else', 'for', 'while', 'do', 'switch', 'case', 'break', 'continue', 'return', 'try', 'catch', 'finally', 'throw', 'new', 'this', 'class', 'extends', 'import', 'export', 'default', 'from', 'null', 'undefined', 'true', 'false'];
keywords.forEach(keyword => {
const regex = new RegExp(`\\b(${keyword})\\b(?![^<]*</span>)`, 'g');
highlightedLine = highlightedLine.replace(regex, '<span class="c4ai-keyword">$1</span>');
});
// Functions and methods
highlightedLine = highlightedLine.replace(/\b([a-zA-Z_$][\w$]*)\s*\(/g, '<span class="c4ai-function">$1</span>(');
// Numbers
highlightedLine = highlightedLine.replace(/\b(\d+)\b/g, '<span class="c4ai-number">$1</span>');
return highlightedLine;
});
codeElement.innerHTML = highlightedLines.join('\n');
}
// Get element selector
function getElementSelector(element) {
// Priority: ID > unique class > tag with position
if (element.id) {
return `#${element.id}`;
}
if (element.className && typeof element.className === 'string') {
const classes = element.className.split(' ').filter(c => c && !c.startsWith('c4ai-'));
if (classes.length > 0) {
const selector = `.${classes[0]}`;
if (document.querySelectorAll(selector).length === 1) {
return selector;
}
}
}
// Build a path selector
const path = [];
let current = element;
while (current && current !== document.body) {
const tagName = current.tagName.toLowerCase();
const parent = current.parentElement;
if (parent) {
const siblings = Array.from(parent.children);
const index = siblings.indexOf(current) + 1;
if (siblings.filter(s => s.tagName === current.tagName).length > 1) {
path.unshift(`${tagName}:nth-child(${index})`);
} else {
path.unshift(tagName);
}
} else {
path.unshift(tagName);
}
current = parent;
}
return path.join(' > ');
}
// Check if element is part of our extension UI
function isOurElement(element) {
return element.classList.contains('c4ai-highlight-box') ||
element.classList.contains('c4ai-toolbar') ||
element.closest('.c4ai-toolbar') ||
element.classList.contains('c4ai-script-toolbar') ||
element.closest('.c4ai-script-toolbar') ||
element.closest('.c4ai-field-dialog') ||
element.closest('.c4ai-code-modal') ||
element.closest('.c4ai-wait-dialog') ||
element.closest('.c4ai-timeline-modal');
}
// Export utilities
window.C4AI_Utils = {
makeDraggable,
makeDraggableByHeader,
escapeHtml,
applySyntaxHighlighting,
applySyntaxHighlightingJS,
getElementSelector,
isOurElement
};

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.4 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 KiB

View File

@@ -0,0 +1,974 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Crawl4AI Assistant - Chrome Extension for Visual Web Scraping</title>
<link rel="stylesheet" href="assistant.css">
</head>
<body>
<div class="terminal-container">
<div class="header">
<div class="header-content">
<div class="logo-section">
<img src="../../img/favicon-32x32.png" alt="Crawl4AI Logo" class="logo">
<div>
<h1>Crawl4AI Assistant</h1>
<p class="tagline">Chrome Extension for Visual Web Scraping</p>
</div>
</div>
<nav class="nav-links">
<a href="../../" class="nav-link">← Back to Docs</a>
<a href="../" class="nav-link">All Apps</a>
<a href="https://github.com/unclecode/crawl4ai" class="nav-link" target="_blank">GitHub</a>
</nav>
</div>
</div>
<div class="content">
<!-- Video Section -->
<section class="video-section">
<div class="video-wrapper">
<video autoplay loop muted playsinline class="demo-video">
<source src="demo.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</div>
</section>
<!-- Cloud Announcement Banner -->
<section class="cloud-banner-section">
<div class="cloud-banner">
<div class="cloud-banner-content">
<div class="cloud-banner-text">
<h3>You don't need Puppeteer. You need Crawl4AI Cloud.</h3>
<p>One API call. JS-rendered. No browser cluster to maintain.</p>
</div>
<button class="cloud-banner-btn" id="joinWaitlistBanner">
Get API Key →
</button>
</div>
</div>
</section>
<!-- Introduction -->
<section class="intro-section">
<div class="terminal-window">
<div class="terminal-header">
<span class="terminal-title">About Crawl4AI Assistant</span>
</div>
<div class="terminal-content">
<p>Transform any website into structured data with just a few clicks! The Crawl4AI Assistant Chrome Extension provides three powerful tools for web scraping and data extraction.</p>
<div style="background: #0fbbaa; color: #070708; padding: 12px 16px; border-radius: 8px; margin: 16px 0; font-weight: 600;">
🎉 NEW: Click2Crawl extracts data INSTANTLY without any LLM! Test your schema and see JSON results immediately in the browser!
</div>
<div class="features-grid">
<div class="feature-card">
<span class="feature-icon">🎯</span>
<h3>Click2Crawl</h3>
<p>Visual data extraction - click elements to build schemas instantly!</p>
</div>
<div class="feature-card">
<span class="feature-icon">🔴</span>
<h3>Script Builder <span style="color: #f380f5; font-size: 0.75rem;">(Alpha)</span></h3>
<p>Record browser actions to create automation scripts</p>
</div>
<div class="feature-card">
<span class="feature-icon">📝</span>
<h3>Markdown Extraction <span style="color: #0fbbaa; font-size: 0.75rem;">(New!)</span></h3>
<p>Convert any webpage content to clean markdown with Visual Text Mode</p>
</div>
<!-- <div class="feature-card">
<span class="feature-icon">🐍</span>
<h3>Python Code</h3>
<p>Get production-ready Crawl4AI code instantly</p>
</div> -->
</div>
</div>
</div>
</section>
<!-- Quick Start -->
<section class="quickstart-section">
<h2>Quick Start</h2>
<div class="terminal-window">
<div class="terminal-header">
<span class="terminal-title">Installation</span>
</div>
<div class="terminal-content">
<div class="installation-steps">
<div class="step">
<span class="step-number">1</span>
<div class="step-content">
<h4>Download the Extension</h4>
<p>Get the latest release from GitHub or use the button below</p>
<a href="crawl4ai-assistant-v1.3.0.zip" class="download-button" download>
<span class="button-icon"></span>
Download Extension (v1.3.0)
</a>
</div>
</div>
<div class="step">
<span class="step-number">2</span>
<div class="step-content">
<h4>Load in Chrome</h4>
<p>Navigate to <code>chrome://extensions/</code> and enable Developer Mode</p>
</div>
</div>
<div class="step">
<span class="step-number">3</span>
<div class="step-content">
<h4>Load Unpacked</h4>
<p>Click "Load unpacked" and select the extracted extension folder</p>
</div>
</div>
</div>
</div>
</div>
</section>
<!-- Interactive Tools Section -->
<section class="interactive-tools">
<h2>Explore Our Tools</h2>
<div class="tools-container">
<!-- Left Panel - Tool Selector -->
<div class="tools-panel">
<div class="tool-selector active" data-tool="click2crawl">
<div class="tool-icon">🎯</div>
<div class="tool-info">
<h3>Click2Crawl</h3>
<p>Visual data extraction</p>
</div>
<div class="tool-status">Available</div>
</div>
<div class="tool-selector" data-tool="script-builder">
<div class="tool-icon">🔴</div>
<div class="tool-info">
<h3>Script Builder</h3>
<p>Browser automation</p>
</div>
<div class="tool-status alpha">Alpha</div>
</div>
<div class="tool-selector" data-tool="markdown-extraction">
<div class="tool-icon">📝</div>
<div class="tool-info">
<h3>Markdown Extraction</h3>
<p>Content to markdown</p>
</div>
<div class="tool-status new">New!</div>
</div>
</div>
<!-- Right Panel - Tool Details -->
<div class="tool-details">
<!-- Click2Crawl Details -->
<div class="tool-content active" id="click2crawl">
<div class="tool-header">
<h3>🎯 Click2Crawl</h3>
<span class="tool-tagline">Click elements to build extraction schemas - No LLM needed!</span>
</div>
<div class="tool-steps">
<div class="step-item">
<div class="step-number">1</div>
<div class="step-content">
<h4>Select Container</h4>
<p>Click on any repeating element like product cards or articles. Use up/down navigation to fine-tune selection!</p>
<div class="step-visual">
<span class="highlight-green"></span> Container highlighted in green
</div>
</div>
</div>
<div class="step-item">
<div class="step-number">2</div>
<div class="step-content">
<h4>Click Fields to Extract</h4>
<p>Click on data fields inside the container - choose text, links, images, or attributes</p>
<div class="step-visual">
<span class="highlight-pink"></span> Fields highlighted in pink
</div>
</div>
</div>
<div class="step-item">
<div class="step-number">3</div>
<div class="step-content">
<h4>Test & Extract Data Instantly!</h4>
<p>🎉 Click "Test Schema" to see extracted JSON immediately - no LLM or coding required!</p>
<div class="step-visual">
<span class="highlight-accent"></span> See extracted JSON immediately
</div>
</div>
</div>
</div>
<div class="tool-features">
<div class="feature-tag">🚀 Zero LLM dependency</div>
<div class="feature-tag">📊 Instant JSON extraction</div>
<div class="feature-tag">🎯 Visual element selection</div>
<div class="feature-tag">🐍 Export Python code</div>
<div class="feature-tag">✨ Live preview</div>
<div class="feature-tag">📥 Download results</div>
<div class="feature-tag">📝 Export to markdown</div>
</div>
</div>
<!-- Script Builder Details -->
<div class="tool-content" id="script-builder">
<div class="tool-header">
<h3>🔴 Script Builder</h3>
<span class="tool-tagline">Record actions, generate automation</span>
</div>
<div class="tool-steps">
<div class="step-item">
<div class="step-number">1</div>
<div class="step-content">
<h4>Hit Record</h4>
<p>Start capturing your browser interactions</p>
<div class="step-visual">
<span class="recording-dot"></span> Recording indicator
</div>
</div>
</div>
<div class="step-item">
<div class="step-number">2</div>
<div class="step-content">
<h4>Interact Naturally</h4>
<p>Click, type, scroll - everything is captured</p>
<div class="step-visual">
<span class="action-icon">🖱️</span> <span class="action-icon">⌨️</span> <span class="action-icon">📜</span>
</div>
</div>
</div>
<div class="step-item">
<div class="step-number">3</div>
<div class="step-content">
<h4>Export Script</h4>
<p>Get JavaScript for Crawl4AI's js_code parameter</p>
<div class="step-visual">
<span class="highlight-accent">📝</span> Automation ready
</div>
</div>
</div>
</div>
<div class="tool-features">
<div class="feature-tag">Smart action grouping</div>
<div class="feature-tag">Wait detection</div>
<div class="feature-tag">Keyboard shortcuts</div>
<div class="feature-tag alpha-tag">Alpha version</div>
</div>
</div>
<!-- Markdown Extraction Details -->
<div class="tool-content" id="markdown-extraction">
<div class="tool-header">
<h3>📝 Markdown Extraction</h3>
<span class="tool-tagline">Convert webpage content to clean markdown "as you see"</span>
</div>
<div class="tool-steps">
<div class="step-item">
<div class="step-number">1</div>
<div class="step-content">
<h4>Ctrl/Cmd + Click</h4>
<p>Hold Ctrl/Cmd and click multiple elements you want to extract</p>
<div class="step-visual">
<span class="highlight-green">🔢</span> Numbered selection badges
</div>
</div>
</div>
<div class="step-item">
<div class="step-number">2</div>
<div class="step-content">
<h4>Enable Visual Text Mode</h4>
<p>Extract content "as you see" - clean text without complex HTML structures</p>
<div class="step-visual">
<span class="highlight-accent">👁️</span> Visual Text Mode (As You See)
</div>
</div>
</div>
<div class="step-item">
<div class="step-number">3</div>
<div class="step-content">
<h4>Export Clean Markdown</h4>
<p>Get beautifully formatted markdown ready for documentation or LLMs</p>
<div class="step-visual">
<span class="highlight-pink">📄</span> Clean, readable output
</div>
</div>
</div>
</div>
<div class="tool-features">
<div class="feature-tag">Multi-select with Ctrl/Cmd</div>
<div class="feature-tag">Visual Text Mode (As You See)</div>
<div class="feature-tag">Clean markdown output</div>
<div class="feature-tag">Export to Crawl4AI Cloud (soon)</div>
</div>
</div>
</div>
</div>
</section>
<!-- Interactive Code Examples -->
<section class="code-showcase">
<h2>See the Generated Code & Extracted Data</h2>
<div class="code-tabs">
<button class="code-tab active" data-example="schema">🎯 Click2Crawl</button>
<button class="code-tab" data-example="script">🔴 Script Builder</button>
<button class="code-tab" data-example="markdown">📝 Markdown Extraction</button>
</div>
<div class="code-examples">
<!-- Click2Crawl Code -->
<div class="code-example active" id="code-schema">
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 16px;">
<!-- Python Code -->
<div class="terminal-window">
<div class="terminal-header">
<span class="terminal-title">click2crawl_extraction.py</span>
<button class="copy-button" data-code="schema-python">Copy</button>
</div>
<div class="terminal-content">
<pre><code><span class="comment">#!/usr/bin/env python3</span>
<span class="comment">"""
🎉 NO LLM NEEDED! Direct extraction with CSS selectors
Generated by Crawl4AI Chrome Extension - Click2Crawl
"""</span>
<span class="keyword">import</span> asyncio
<span class="keyword">import</span> json
<span class="keyword">from</span> crawl4ai <span class="keyword">import</span> AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
<span class="keyword">from</span> crawl4ai.extraction_strategy <span class="keyword">import</span> JsonCssExtractionStrategy
<span class="comment"># The EXACT schema from Click2Crawl - no guessing!</span>
EXTRACTION_SCHEMA = {
<span class="string">"name"</span>: <span class="string">"Product Catalog"</span>,
<span class="string">"baseSelector"</span>: <span class="string">"div.product-card"</span>, <span class="comment"># The container you selected</span>
<span class="string">"fields"</span>: [
{
<span class="string">"name"</span>: <span class="string">"title"</span>,
<span class="string">"selector"</span>: <span class="string">"h3.product-title"</span>,
<span class="string">"type"</span>: <span class="string">"text"</span>
},
{
<span class="string">"name"</span>: <span class="string">"price"</span>,
<span class="string">"selector"</span>: <span class="string">"span.price"</span>,
<span class="string">"type"</span>: <span class="string">"text"</span>
},
{
<span class="string">"name"</span>: <span class="string">"image"</span>,
<span class="string">"selector"</span>: <span class="string">"img.product-img"</span>,
<span class="string">"type"</span>: <span class="string">"attribute"</span>,
<span class="string">"attribute"</span>: <span class="string">"src"</span>
},
{
<span class="string">"name"</span>: <span class="string">"link"</span>,
<span class="string">"selector"</span>: <span class="string">"a.product-link"</span>,
<span class="string">"type"</span>: <span class="string">"attribute"</span>,
<span class="string">"attribute"</span>: <span class="string">"href"</span>
}
]
}
<span class="keyword">async</span> <span class="keyword">def</span> <span class="function">extract_data</span>(url: str):
<span class="comment"># Direct extraction - no LLM API calls!</span>
extraction_strategy = JsonCssExtractionStrategy(schema=EXTRACTION_SCHEMA)
<span class="keyword">async</span> <span class="keyword">with</span> AsyncWebCrawler() <span class="keyword">as</span> crawler:
result = <span class="keyword">await</span> crawler.arun(
url=url,
config=CrawlerRunConfig(extraction_strategy=extraction_strategy)
)
<span class="keyword">if</span> result.success:
data = json.loads(result.extracted_content)
<span class="keyword">print</span>(<span class="string">f"✅ Extracted {len(data)} items instantly!"</span>)
<span class="comment"># Save to file</span>
<span class="keyword">with</span> open(<span class="string">'products.json'</span>, <span class="string">'w'</span>) <span class="keyword">as</span> f:
json.dump(data, f, indent=2)
<span class="keyword">return</span> data
<span class="comment"># Run extraction on any similar page!</span>
data = asyncio.run(extract_data(<span class="string">"https://example.com/products"</span>))
<span class="comment"># 🎯 Result: Clean JSON data, no LLM costs, instant results!</span></code></pre>
</div>
</div>
<!-- Extracted JSON Data -->
<div class="terminal-window">
<div class="terminal-header">
<span class="terminal-title">extracted_data.json</span>
<button class="copy-button" data-code="schema-json">Copy</button>
</div>
<div class="terminal-content">
<pre><code><span class="comment">// 🎉 Instantly extracted from the page - no coding required!</span>
[
{
<span class="string">"title"</span>: <span class="string">"Wireless Bluetooth Headphones"</span>,
<span class="string">"price"</span>: <span class="string">"$79.99"</span>,
<span class="string">"image"</span>: <span class="string">"https://example.com/images/headphones-bt-01.jpg"</span>,
<span class="string">"link"</span>: <span class="string">"/products/wireless-bluetooth-headphones"</span>
},
{
<span class="string">"title"</span>: <span class="string">"Smart Watch Pro 2024"</span>,
<span class="string">"price"</span>: <span class="string">"$299.00"</span>,
<span class="string">"image"</span>: <span class="string">"https://example.com/images/smartwatch-pro.jpg"</span>,
<span class="string">"link"</span>: <span class="string">"/products/smart-watch-pro-2024"</span>
},
{
<span class="string">"title"</span>: <span class="string">"4K Webcam for Streaming"</span>,
<span class="string">"price"</span>: <span class="string">"$149.99"</span>,
<span class="string">"image"</span>: <span class="string">"https://example.com/images/webcam-4k.jpg"</span>,
<span class="string">"link"</span>: <span class="string">"/products/4k-webcam-streaming"</span>
},
{
<span class="string">"title"</span>: <span class="string">"Mechanical Gaming Keyboard RGB"</span>,
<span class="string">"price"</span>: <span class="string">"$129.99"</span>,
<span class="string">"image"</span>: <span class="string">"https://example.com/images/keyboard-gaming.jpg"</span>,
<span class="string">"link"</span>: <span class="string">"/products/mechanical-gaming-keyboard"</span>
},
{
<span class="string">"title"</span>: <span class="string">"USB-C Hub 7-in-1"</span>,
<span class="string">"price"</span>: <span class="string">"$45.99"</span>,
<span class="string">"image"</span>: <span class="string">"https://example.com/images/usbc-hub.jpg"</span>,
<span class="string">"link"</span>: <span class="string">"/products/usb-c-hub-7in1"</span>
}
]</code></pre>
</div>
</div>
</div>
</div>
<!-- Script Builder Code -->
<div class="code-example" id="code-script">
<div class="terminal-window">
<div class="terminal-header">
<span class="terminal-title">automation_script.py</span>
<button class="copy-button" data-code="script">Copy</button>
</div>
<div class="terminal-content">
<pre><code><span class="keyword">import</span> asyncio
<span class="keyword">from</span> crawl4ai <span class="keyword">import</span> AsyncWebCrawler, CrawlerRunConfig
<span class="comment"># JavaScript generated from your recorded actions</span>
js_script = <span class="string">"""
// Search for products
document.querySelector('button.search-toggle').click();
await new Promise(r => setTimeout(r, 500));
// Type search query
const searchInput = document.querySelector('input#search');
searchInput.value = 'wireless headphones';
searchInput.dispatchEvent(new Event('input', {bubbles: true}));
// Submit search
searchInput.dispatchEvent(new KeyboardEvent('keydown', {
key: 'Enter', keyCode: 13, bubbles: true
}));
// Wait for results
await new Promise(r => setTimeout(r, 2000));
// Click first product
document.querySelector('.product-item:first-child').click();
// Wait for product page
await new Promise(r => setTimeout(r, 1000));
// Add to cart
document.querySelector('button.add-to-cart').click();
"""</span>
<span class="keyword">async</span> <span class="keyword">def</span> <span class="function">automate_shopping</span>():
config = CrawlerRunConfig(
js_code=js_script,
wait_for=<span class="string">"css:.cart-confirmation"</span>,
screenshot=<span class="keyword">True</span>
)
<span class="keyword">async</span> <span class="keyword">with</span> AsyncWebCrawler() <span class="keyword">as</span> crawler:
result = <span class="keyword">await</span> crawler.arun(
url=<span class="string">"https://shop.example.com"</span>,
config=config
)
<span class="keyword">print</span>(<span class="string">f"✓ Automation complete: {result.url}"</span>)
<span class="keyword">return</span> result
asyncio.run(automate_shopping())</code></pre>
</div>
</div>
</div>
<!-- Markdown Extraction Output -->
<div class="code-example" id="code-markdown">
<div class="terminal-window">
<div class="terminal-header">
<span class="terminal-title">extracted_content.md</span>
<button class="copy-button" data-code="markdown">Copy</button>
</div>
<div class="terminal-content">
<pre><code><span class="comment"># Extracted from Hacker News with Visual Text Mode 👁️</span>
<span class="string">1. **Show HN: I built a tool to find and reach out to YouTubers** (hellosimply.io)
84 points by erickim 2 hours ago | hide | 31 comments
2. **The 24 Hour Restaurant** (logicmag.io)
124 points by helsinkiandrew 5 hours ago | hide | 52 comments
3. **Building a Better Bloom Filter in Rust** (carlmastrangelo.com)
89 points by carlmastrangelo 3 hours ago | hide | 27 comments
---
### Article: The 24 Hour Restaurant
In New York City, the 24-hour restaurant is becoming extinct. What we lose when we can no longer eat whenever we want.
When I first moved to New York, I loved that I could get a full meal at 3 AM. Not just pizza or fast food, but a proper sit-down dinner with table service and a menu that ran for pages. The city that never sleeps had restaurants that matched its rhythm.
Today, finding a 24-hour restaurant in Manhattan requires genuine effort. The pandemic accelerated a decline that was already underway, but the roots go deeper: rising rents, changing labor laws, and shifting cultural patterns have all contributed to the death of round-the-clock dining.
---
### Product Review: Framework Laptop 16
**Specifications:**
- Display: 16" 2560×1600 165Hz
- Processor: AMD Ryzen 7 7840HS
- Memory: 32GB DDR5-5600
- Storage: 2TB NVMe Gen4
- Price: Starting at $1,399
**Pros:**
- Fully modular and repairable
- Excellent Linux support
- Great keyboard and trackpad
- Expansion card system
**Cons:**
- Battery life could be better
- Slightly heavier than competitors
- Fan noise under load</span></code></pre>
</div>
</div>
</div>
</div>
</section>
<!-- Crawl4AI Cloud Section -->
<section class="cloud-section">
<div class="cloud-announcement">
<h2>Crawl4AI Cloud</h2>
<p class="cloud-tagline">Your browser cluster without the cluster.</p>
<div class="cloud-features-preview">
<div class="cloud-feature-item">
⚡ POST /crawl
</div>
<div class="cloud-feature-item">
🌐 JS-rendered pages
</div>
<div class="cloud-feature-item">
📊 Schema extraction built-in
</div>
<div class="cloud-feature-item">
💰 $0.001/page
</div>
</div>
<button class="cloud-cta-button" id="joinWaitlist">
Get Early Access →
</button>
<p class="cloud-hint">See it extract your own data. Right now.</p>
</div>
<!-- Hidden Signup Form -->
<div class="signup-overlay" id="signupOverlay">
<div class="signup-container" id="signupContainer">
<button class="close-signup" id="closeSignup">×</button>
<div class="signup-content" id="signupForm">
<h3>🚀 Join C4AI Cloud Waiting List</h3>
<p>Be among the first to experience the future of web scraping</p>
<form id="waitlistForm" class="waitlist-form">
<div class="form-field">
<label for="userName">Your Name</label>
<input type="text" id="userName" name="name" placeholder="John Doe" required>
</div>
<div class="form-field">
<label for="userEmail">Email Address</label>
<input type="email" id="userEmail" name="email" placeholder="john@example.com" required>
</div>
<div class="form-field">
<label for="userCompany">Company (Optional)</label>
<input type="text" id="userCompany" name="company" placeholder="Acme Inc.">
</div>
<div class="form-field">
<label for="useCase">What will you use Crawl4AI Cloud for?</label>
<select id="useCase" name="useCase">
<option value="">Select use case...</option>
<option value="price-monitoring">Price Monitoring</option>
<option value="news-aggregation">News Aggregation</option>
<option value="market-research">Market Research</option>
<option value="ai-training">AI Training Data</option>
<option value="other">Other</option>
</select>
</div>
<button type="submit" class="submit-button">
<span>🎯</span> Submit & Watch the Magic
</button>
</form>
</div>
<!-- Crawling Animation -->
<div class="crawl-animation" id="crawlAnimation" style="display: none;">
<div class="terminal-window crawl-terminal">
<div class="terminal-header">
<span class="terminal-title">Crawl4AI Cloud Demo</span>
</div>
<div class="terminal-content">
<pre id="crawlOutput" class="crawl-log"><code>$ crawl4ai cloud extract --url "signup-form" --auto-detect</code></pre>
</div>
</div>
<div class="extracted-preview" id="extractedPreview" style="display: none;">
<h4>📊 Extracted Data</h4>
<pre class="json-preview"><code id="jsonOutput"></code></pre>
</div>
<div class="success-message" id="successMessage" style="display: none;">
<div class="success-icon"></div>
<h3>Data Uploaded Successfully!</h3>
<p>You're on the Crawl4AI Cloud waiting list!</p>
<p>What you just witnessed:</p>
<ul>
<li>⚡ Real-time extraction of your form data</li>
<li>🔄 Automatic schema detection</li>
<li>📤 Instant cloud processing</li>
<li>✨ No code required - just like that!</li>
</ul>
<p class="success-note">We'll notify you at <strong id="userEmailDisplay"></strong> when Crawl4AI Cloud launches!</p>
<button class="continue-button" id="continueBtn">Continue Exploring</button>
</div>
</div>
</div>
</div>
</section>
<!-- Coming Soon Section -->
<section class="coming-soon-section">
<h2>More Features Coming Soon</h2>
<div class="terminal-window">
<div class="terminal-header">
<span class="terminal-title">Roadmap</span>
</div>
<div class="terminal-content">
<p class="intro-text">We're continuously expanding C4AI Assistant with powerful new features:</p>
<div class="coming-features">
<div class="coming-feature">
<div class="feature-header">
<span class="feature-badge">Direct</span>
<h3>Direct Data Download</h3>
</div>
<p>Skip the code generation entirely! Download extracted data directly from Click2Crawl as JSON or CSV files.</p>
<div class="feature-preview">
<code>📊 One-click download • No Python needed • Multiple export formats</code>
</div>
</div>
<div class="coming-feature">
<div class="feature-header">
<span class="feature-badge">AI</span>
<h3>Smart Field Detection</h3>
</div>
<p>AI-powered field detection for Click2Crawl that automatically suggests the most likely data fields on any page.</p>
<div class="feature-preview">
<code>🤖 Auto-detect fields • Smart naming • Pattern recognition</code>
</div>
</div>
</div>
<div class="stay-tuned">
<p>🚀 Stay tuned for updates! Follow our <a href="https://github.com/unclecode/crawl4ai" target="_blank">GitHub</a> for the latest releases.</p>
</div>
</div>
</div>
</section>
<!-- Footer -->
<footer class="footer">
<div class="footer-content">
<div class="footer-section">
<h4>Resources</h4>
<ul>
<li><a href="https://github.com/unclecode/crawl4ai">GitHub Repository</a></li>
<li><a href="../../">Documentation</a></li>
<li><a href="https://discord.gg/jP8KfhDhyN">Discord Community</a></li>
</ul>
</div>
<div class="footer-section">
<h4>Connect</h4>
<ul>
<li><a href="https://twitter.com/unclecode">Twitter @unclecode</a></li>
<li><a href="https://github.com/unclecode">GitHub @unclecode</a></li>
</ul>
</div>
</div>
<div class="footer-bottom">
<p>Made with 🚀 by the Crawl4AI team</p>
</div>
</footer>
</div>
</div>
<script>
// Tool Selector Interaction
document.querySelectorAll('.tool-selector').forEach(selector => {
selector.addEventListener('click', function() {
// Remove active class from all selectors
document.querySelectorAll('.tool-selector').forEach(s => s.classList.remove('active'));
document.querySelectorAll('.tool-content').forEach(c => c.classList.remove('active'));
// Add active class to clicked selector
this.classList.add('active');
// Show corresponding content
const toolId = this.getAttribute('data-tool');
const contentElement = document.getElementById(toolId);
if (contentElement) {
contentElement.classList.add('active');
}
});
});
// Code Tab Interaction
document.querySelectorAll('.code-tab').forEach(tab => {
tab.addEventListener('click', function() {
// Remove active class from all tabs
document.querySelectorAll('.code-tab').forEach(t => t.classList.remove('active'));
document.querySelectorAll('.code-example').forEach(e => e.classList.remove('active'));
// Add active class to clicked tab
this.classList.add('active');
// Show corresponding code
const exampleId = this.getAttribute('data-example');
document.getElementById('code-' + exampleId).classList.add('active');
});
});
// Copy Button Functionality
document.querySelectorAll('.copy-button').forEach(button => {
button.addEventListener('click', async function() {
const codeType = this.getAttribute('data-code');
let codeText = '';
// Handle different code types
if (codeType === 'schema-python') {
const codeElement = document.querySelector('#code-schema .terminal-window:first-child pre code');
codeText = codeElement.textContent;
} else if (codeType === 'schema-json') {
const codeElement = document.querySelector('#code-schema .terminal-window:last-child pre code');
codeText = codeElement.textContent;
} else {
const codeElement = document.getElementById('code-' + codeType).querySelector('pre code');
codeText = codeElement.textContent;
}
try {
await navigator.clipboard.writeText(codeText);
this.textContent = 'Copied!';
this.classList.add('copied');
setTimeout(() => {
this.textContent = 'Copy';
this.classList.remove('copied');
}, 2000);
} catch (err) {
console.error('Failed to copy code:', err);
}
});
});
// Crawl4AI Cloud Interactive Demo
const joinWaitlistBtn = document.getElementById('joinWaitlist');
const signupOverlay = document.getElementById('signupOverlay');
const closeSignupBtn = document.getElementById('closeSignup');
const waitlistForm = document.getElementById('waitlistForm');
const signupForm = document.getElementById('signupForm');
const crawlAnimation = document.getElementById('crawlAnimation');
const crawlOutput = document.getElementById('crawlOutput');
const extractedPreview = document.getElementById('extractedPreview');
const jsonOutput = document.getElementById('jsonOutput');
const successMessage = document.getElementById('successMessage');
const continueBtn = document.getElementById('continueBtn');
const userEmailDisplay = document.getElementById('userEmailDisplay');
// Open signup modal
joinWaitlistBtn.addEventListener('click', () => {
signupOverlay.classList.add('active');
});
// Banner button
const joinWaitlistBannerBtn = document.getElementById('joinWaitlistBanner');
if (joinWaitlistBannerBtn) {
joinWaitlistBannerBtn.addEventListener('click', () => {
signupOverlay.classList.add('active');
});
}
// Close signup modal
closeSignupBtn.addEventListener('click', () => {
signupOverlay.classList.remove('active');
});
// Close on overlay click
signupOverlay.addEventListener('click', (e) => {
if (e.target === signupOverlay) {
signupOverlay.classList.remove('active');
}
});
// Continue button
if (continueBtn) {
continueBtn.addEventListener('click', () => {
signupOverlay.classList.remove('active');
// Reset form for next time
waitlistForm.reset();
signupForm.style.display = 'block';
crawlAnimation.style.display = 'none';
extractedPreview.style.display = 'none';
successMessage.style.display = 'none';
});
}
// Form submission with crawling animation
waitlistForm.addEventListener('submit', async (e) => {
e.preventDefault();
// Get form data
const formData = {
name: document.getElementById('userName').value,
email: document.getElementById('userEmail').value,
company: document.getElementById('userCompany').value || 'Not specified',
useCase: document.getElementById('useCase').value || 'General web scraping',
timestamp: new Date().toISOString(),
source: 'Crawl4AI Assistant Landing Page'
};
// Update email display
userEmailDisplay.textContent = formData.email;
// Hide form and show crawling animation
signupForm.style.display = 'none';
crawlAnimation.style.display = 'block';
// Clear previous output
const codeElement = crawlOutput.querySelector('code');
codeElement.innerHTML = '$ crawl4ai cloud extract --url "signup-form" --auto-detect\n\n';
// Simulate crawling process with proper C4AI log format
const crawlSteps = [
{
log: '<span class="log-init">[INIT]....</span> → Crawl4AI Cloud 1.0.0',
time: '0.12s'
},
{
log: '<span class="log-fetch">[FETCH]...</span> ↓ https://crawl4ai.com/waitlist-form',
time: '0.45s'
},
{
log: '<span class="log-scrape">[SCRAPE]..</span> ◆ https://crawl4ai.com/waitlist-form',
time: '0.28s'
},
{
log: '<span class="log-extract">[EXTRACT].</span> ■ Extracting form data with auto-detect',
time: '0.55s'
},
{
log: '<span class="log-complete">[COMPLETE]</span> ● https://crawl4ai.com/waitlist-form',
time: '1.40s'
}
];
let stepIndex = 0;
const typeStep = async () => {
if (stepIndex < crawlSteps.length) {
const step = crawlSteps[stepIndex];
codeElement.innerHTML += step.log + ' | <span class="log-success">✓</span> | <span class="log-time">⏱: ' + step.time + '</span>\n';
stepIndex++;
// Scroll to bottom
const terminal = crawlOutput.parentElement;
terminal.scrollTop = terminal.scrollHeight;
setTimeout(typeStep, 600);
} else {
// Show extracted data
setTimeout(() => {
codeElement.innerHTML += '\n<span class="log-success">[UPLOAD]..</span> ↑ Uploading to Crawl4AI Cloud...';
setTimeout(() => {
extractedPreview.style.display = 'block';
jsonOutput.textContent = JSON.stringify(formData, null, 2);
// Add syntax highlighting
jsonOutput.innerHTML = jsonOutput.textContent
.replace(/"([^"]+)":/g, '<span class="string">"$1"</span>:')
.replace(/: "([^"]+)"/g, ': <span class="string">"$1"</span>');
codeElement.innerHTML += ' | <span class="log-success">✓</span> | <span class="log-time">⏱: 0.23s</span>\n';
codeElement.innerHTML += '\n<span class="log-success">[SUCCESS]</span> ✨ Data uploaded successfully!';
// Show success message after a delay
setTimeout(() => {
successMessage.style.display = 'block';
// Smooth scroll to bottom to show success message
setTimeout(() => {
const container = document.getElementById('signupContainer');
container.scrollTo({
top: container.scrollHeight,
behavior: 'smooth'
});
}, 100);
// Actually submit to waiting list (you can implement this)
console.log('Waitlist submission:', formData);
}, 1500);
}, 800);
}, 600);
}
};
// Start the animation
setTimeout(typeStep, 500);
});
</script>
</body>
</html>

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,54 @@
{
"manifest_version": 3,
"name": "Crawl4AI Assistant",
"version": "1.3.0",
"description": "Visual schema and script builder for Crawl4AI - Build extraction schemas and automation scripts by clicking and recording actions",
"permissions": [
"activeTab",
"storage",
"downloads"
],
"host_permissions": [
"<all_urls>"
],
"action": {
"default_popup": "popup/popup.html",
"default_icon": {
"16": "icons/icon-16.png",
"48": "icons/icon-48.png",
"128": "icons/icon-128.png"
}
},
"content_scripts": [
{
"matches": ["<all_urls>"],
"js": [
"libs/marked.min.js",
"content/shared/utils.js",
"content/markdownPreviewModal.js",
"content/click2crawl.js",
"content/scriptBuilder.js",
"content/contentAnalyzer.js",
"content/markdownConverter.js",
"content/markdownExtraction.js",
"content/content.js"
],
"css": ["content/overlay.css"],
"run_at": "document_idle"
}
],
"background": {
"service_worker": "background/service-worker.js"
},
"icons": {
"16": "icons/icon-16.png",
"48": "icons/icon-48.png",
"128": "icons/icon-128.png"
},
"web_accessible_resources": [
{
"resources": ["icons/*", "assets/*"],
"matches": ["<all_urls>"]
}
]
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.4 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 KiB

View File

@@ -0,0 +1,330 @@
/* Font Face Definitions */
@font-face {
font-family: 'Dank Mono';
src: url('../assets/DankMono-Regular.woff2') format('woff2');
font-weight: 400;
font-style: normal;
font-display: swap;
}
@font-face {
font-family: 'Dank Mono';
src: url('../assets/DankMono-Bold.woff2') format('woff2');
font-weight: 700;
font-style: normal;
font-display: swap;
}
@font-face {
font-family: 'Dank Mono';
src: url('../assets/DankMono-Italic.woff2') format('woff2');
font-weight: 400;
font-style: italic;
font-display: swap;
}
:root {
--font-primary: 'Dank Mono', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, monospace;
}
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
width: 380px;
font-family: var(--font-primary);
background: #0a0a0a;
color: #e0e0e0;
border-radius: 16px;
overflow: hidden;
}
.popup-container {
padding: 20px;
}
header {
display: flex;
align-items: center;
gap: 16px;
margin-bottom: 20px;
padding-bottom: 16px;
border-bottom: 1px solid #2a2a2a;
}
.logo {
width: 48px;
height: 48px;
flex-shrink: 0;
}
.header-content {
flex: 1;
}
header h1 {
font-size: 20px;
font-weight: 600;
color: #fff;
margin: 0 0 4px 0;
}
.header-stats {
display: flex;
align-items: center;
gap: 12px;
}
.github-stars {
display: flex;
align-items: center;
gap: 6px;
color: #999;
text-decoration: none;
font-size: 13px;
transition: color 0.2s ease;
}
.github-stars:hover {
color: #4a9eff;
}
.github-icon {
flex-shrink: 0;
}
.mode-selector {
display: flex;
flex-direction: column;
gap: 12px;
margin-bottom: 20px;
}
.mode-button {
display: flex;
align-items: center;
gap: 16px;
padding: 16px;
background: #1a1a1a;
border: 2px solid #2a2a2a;
border-radius: 12px;
cursor: pointer;
transition: all 0.2s ease;
width: 100%;
text-align: left;
}
.mode-button:hover:not(:disabled) {
background: #252525;
border-color: #4a9eff;
transform: translateY(-2px);
}
.mode-button:disabled {
opacity: 0.5;
cursor: not-allowed;
}
.mode-button .icon {
font-size: 32px;
width: 48px;
height: 48px;
display: flex;
align-items: center;
justify-content: center;
background: #252525;
border-radius: 8px;
}
.mode-button.schema .icon {
background: #1e3a5f;
}
.mode-button.script .icon {
background: #3a1e5f;
}
.mode-button.c2c .icon {
background: #1e5f3a;
}
.mode-info h3 {
font-size: 16px;
color: #fff;
margin-bottom: 4px;
}
.mode-info p {
font-size: 13px;
color: #999;
line-height: 1.4;
}
.active-session {
background: #1a1a1a;
border: 2px solid #4a9eff;
border-radius: 12px;
padding: 16px;
margin-bottom: 20px;
}
.active-session.hidden {
display: none;
}
.session-header {
display: flex;
align-items: center;
gap: 8px;
margin-bottom: 12px;
}
.status-dot {
width: 8px;
height: 8px;
background: #4a9eff;
border-radius: 50%;
animation: pulse 2s infinite;
}
@keyframes pulse {
0%, 100% { opacity: 1; }
50% { opacity: 0.5; }
}
.session-title {
font-size: 14px;
font-weight: 600;
color: #4a9eff;
}
.session-stats {
display: flex;
gap: 20px;
margin-bottom: 16px;
padding: 12px;
background: #0a0a0a;
border-radius: 8px;
}
.stat {
flex: 1;
}
.stat-label {
display: block;
font-size: 11px;
color: #666;
text-transform: uppercase;
margin-bottom: 4px;
}
.stat-value {
font-size: 14px;
color: #fff;
font-weight: 600;
}
.session-actions {
display: flex;
gap: 8px;
}
.action-button {
flex: 1;
padding: 10px 16px;
border: none;
border-radius: 8px;
font-size: 14px;
font-weight: 600;
cursor: pointer;
transition: all 0.2s ease;
}
.action-button.primary {
background: #4a9eff;
color: #000;
}
.action-button.primary:hover:not(:disabled) {
background: #3a8eef;
transform: translateY(-1px);
}
.action-button.primary:disabled {
background: #2a4a7f;
color: #666;
cursor: not-allowed;
}
.action-button.secondary {
background: #2a2a2a;
color: #fff;
}
.action-button.secondary:hover {
background: #3a3a3a;
}
.instructions {
background: #1a1a1a;
border-radius: 12px;
padding: 16px;
margin-bottom: 16px;
}
.instructions h4 {
font-size: 14px;
margin-bottom: 12px;
color: #fff;
}
.instructions ol {
padding-left: 20px;
}
.instructions li {
font-size: 13px;
line-height: 1.6;
color: #ccc;
margin-bottom: 6px;
}
footer {
padding-top: 16px;
border-top: 1px solid #2a2a2a;
}
.social-links {
display: flex;
justify-content: center;
gap: 16px;
}
.social-link {
display: flex;
align-items: center;
gap: 6px;
color: #999;
text-decoration: none;
font-size: 12px;
transition: all 0.2s ease;
padding: 6px 12px;
border-radius: 6px;
background: #1a1a1a;
}
.social-link:hover {
color: #0fbbaa;
background: #2a2a2a;
transform: translateY(-1px);
}
.social-link svg {
width: 16px;
height: 16px;
flex-shrink: 0;
}

View File

@@ -0,0 +1,111 @@
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<link rel="stylesheet" href="popup.css">
</head>
<body>
<div class="popup-container">
<header>
<img src="icons/icon-48.png" class="logo" alt="Crawl4AI">
<div class="header-content">
<h1>Crawl4AI Assistant</h1>
<div class="header-stats">
<a href="https://github.com/unclecode/crawl4ai" target="_blank" class="github-stars">
<svg class="github-icon" viewBox="0 0 16 16" width="16" height="16">
<path fill="currentColor" d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0016 8c0-4.42-3.58-8-8-8z"></path>
</svg>
<span id="stars-count">Loading...</span>
</a>
</div>
</div>
</header>
<div class="mode-selector">
<button id="schema-mode" class="mode-button schema">
<div class="icon">🎯</div>
<div class="mode-info">
<h3>Click2Crawl</h3>
<p>Click elements to build extraction schemas</p>
</div>
</button>
<button id="script-mode" class="mode-button script">
<div class="icon">🔴</div>
<div class="mode-info">
<h3>Script Builder <span style="color: #ff3c74; font-size: 10px;">(Alpha)</span></h3>
<p>Record actions to build automation scripts</p>
</div>
</button>
<button id="c2c-mode" class="mode-button c2c">
<div class="icon">📝</div>
<div class="mode-info">
<h3>Markdown Extraction</h3>
<p>Select elements and convert to clean markdown</p>
</div>
</button>
</div>
<div id="active-session" class="active-session hidden">
<div class="session-header">
<span class="status-dot"></span>
<span class="session-title">Schema Capture Active</span>
</div>
<div class="session-stats">
<div class="stat">
<span class="stat-label">Container:</span>
<span id="container-status" class="stat-value">Not selected</span>
</div>
<div class="stat">
<span class="stat-label">Fields:</span>
<span id="fields-count" class="stat-value">0</span>
</div>
</div>
<div class="session-actions">
<button id="generate-code" class="action-button primary" disabled>
<span>Generate Code</span>
</button>
<button id="stop-capture" class="action-button secondary">
<span>Stop Capture</span>
</button>
</div>
</div>
<div class="instructions" style="display: none;">
<h4>How to use:</h4>
<ol>
<li>Click "Click2Crawl" to start</li>
<li>Click on a container element (e.g., product card)</li>
<li>Click individual fields inside and name them</li>
<li>Generate Python code when done</li>
</ol>
</div>
<footer>
<div class="social-links">
<a href="https://docs.crawl4ai.com" target="_blank" class="social-link">
<svg viewBox="0 0 24 24" width="16" height="16">
<path fill="currentColor" d="M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm-2 15l-5-5 1.41-1.41L10 14.17l7.59-7.59L19 8l-9 9z"/>
</svg>
<span>Docs</span>
</a>
<a href="https://twitter.com/unclecode" target="_blank" class="social-link">
<svg viewBox="0 0 24 24" width="16" height="16">
<path fill="currentColor" d="M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-5.214-6.817L4.99 21.75H1.68l7.73-8.835L1.254 2.25H8.08l4.713 6.231zm-1.161 17.52h1.833L7.084 4.126H5.117z"/>
</svg>
<span>@unclecode</span>
</a>
<a href="https://discord.gg/jP8KfhDhyN" target="_blank" class="social-link">
<svg viewBox="0 0 24 24" width="16" height="16">
<path fill="currentColor" d="M19.27 5.33C17.94 4.71 16.5 4.26 15 4a.09.09 0 00-.07.03c-.18.33-.39.76-.53 1.09a16.09 16.09 0 00-4.8 0c-.14-.34-.35-.76-.54-1.09-.01-.02-.04-.03-.07-.03-1.5.26-2.93.71-4.27 1.33-.01 0-.02.01-.03.02-2.72 4.07-3.47 8.03-3.1 11.95 0 .02.01.04.03.05 1.8 1.32 3.53 2.12 5.24 2.65.03.01.06 0 .07-.02.4-.55.76-1.13 1.07-1.74.02-.04 0-.08-.04-.09-.57-.22-1.11-.48-1.64-.78-.04-.02-.04-.08-.01-.11.11-.08.22-.17.33-.25.02-.02.05-.02.07-.01 3.44 1.57 7.15 1.57 10.55 0 .02-.01.05-.01.07.01.11.09.22.17.33.26.04.03.04.09-.01.11-.52.31-1.07.56-1.64.78-.04.01-.05.06-.04.09.32.61.68 1.19 1.07 1.74.03.01.06.02.09.01 1.72-.53 3.45-1.33 5.25-2.65.02-.01.03-.03.03-.05.44-4.53-.73-8.46-3.1-11.95-.01-.01-.02-.02-.04-.02zM8.52 14.91c-1.03 0-1.89-.95-1.89-2.12s.84-2.12 1.89-2.12c1.06 0 1.9.96 1.89 2.12 0 1.17-.84 2.12-1.89 2.12zm6.97 0c-1.03 0-1.89-.95-1.89-2.12s.84-2.12 1.89-2.12c1.06 0 1.9.96 1.89 2.12 0 1.17-.83 2.12-1.89 2.12z"/>
</svg>
<span>Discord</span>
</a>
</div>
</footer>
</div>
<script src="popup.js"></script>
</body>
</html>

View File

@@ -0,0 +1,146 @@
// Popup script for Crawl4AI Assistant
let activeMode = null;
document.addEventListener('DOMContentLoaded', () => {
// Fetch GitHub stars
fetchGitHubStars();
// Check current state
chrome.storage.local.get(['captureMode', 'captureStats'], (data) => {
if (data.captureMode) {
activeMode = data.captureMode;
showActiveSession(data.captureStats || {});
}
});
// Mode buttons
document.getElementById('schema-mode').addEventListener('click', () => {
startSchemaCapture();
});
document.getElementById('script-mode').addEventListener('click', () => {
startScriptCapture();
});
document.getElementById('c2c-mode').addEventListener('click', () => {
startClick2Crawl();
});
// Session actions
document.getElementById('generate-code').addEventListener('click', () => {
generateCode();
});
document.getElementById('stop-capture').addEventListener('click', () => {
stopCapture();
});
});
async function fetchGitHubStars() {
try {
const response = await fetch('https://api.github.com/repos/unclecode/crawl4ai');
const data = await response.json();
const stars = data.stargazers_count;
// Format the number (e.g., 1.2k)
let formattedStars;
if (stars >= 1000) {
formattedStars = (stars / 1000).toFixed(1) + 'k';
} else {
formattedStars = stars.toString();
}
document.getElementById('stars-count').textContent = `${formattedStars}`;
} catch (error) {
console.error('Failed to fetch GitHub stars:', error);
document.getElementById('stars-count').textContent = '⭐ 2k+';
}
}
function startSchemaCapture() {
chrome.tabs.query({ active: true, currentWindow: true }, (tabs) => {
chrome.tabs.sendMessage(tabs[0].id, {
action: 'startSchemaCapture'
}, (response) => {
if (response && response.success) {
// Close the popup to let user interact with the page
window.close();
}
});
});
}
function startScriptCapture() {
chrome.tabs.query({ active: true, currentWindow: true }, (tabs) => {
chrome.tabs.sendMessage(tabs[0].id, {
action: 'startScriptCapture'
}, (response) => {
if (response && response.success) {
// Close the popup to let user interact with the page
window.close();
}
});
});
}
function startClick2Crawl() {
chrome.tabs.query({ active: true, currentWindow: true }, (tabs) => {
chrome.tabs.sendMessage(tabs[0].id, {
action: 'startClick2Crawl'
}, (response) => {
if (response && response.success) {
// Close the popup to let user interact with the page
window.close();
}
});
});
}
function showActiveSession(stats) {
document.querySelector('.mode-selector').style.display = 'none';
document.getElementById('active-session').classList.remove('hidden');
updateSessionStats(stats);
}
function updateSessionStats(stats) {
document.getElementById('container-status').textContent =
stats.container ? 'Selected ✓' : 'Not selected';
document.getElementById('fields-count').textContent = stats.fields || 0;
// Enable generate button if we have container and fields
document.getElementById('generate-code').disabled =
!stats.container || stats.fields === 0;
}
function generateCode() {
chrome.tabs.query({ active: true, currentWindow: true }, (tabs) => {
chrome.tabs.sendMessage(tabs[0].id, {
action: 'generateCode'
});
});
}
function stopCapture() {
chrome.tabs.query({ active: true, currentWindow: true }, (tabs) => {
chrome.tabs.sendMessage(tabs[0].id, {
action: 'stopCapture'
}, () => {
// Reset UI
document.querySelector('.mode-selector').style.display = 'flex';
document.getElementById('active-session').classList.add('hidden');
activeMode = null;
// Clear storage
chrome.storage.local.remove(['captureMode', 'captureStats']);
});
});
}
// Listen for stats updates from content script
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
if (message.action === 'updateStats') {
updateSessionStats(message.stats);
chrome.storage.local.set({ captureStats: message.stats });
}
});

302
docs/md_v2/apps/index.md Normal file
View File

@@ -0,0 +1,302 @@
# 🚀 Crawl4AI Interactive Apps
Welcome to the Crawl4AI Apps Hub - your gateway to interactive tools and demos that make web scraping more intuitive and powerful.
<style>
.apps-container {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(350px, 1fr));
gap: 2rem;
margin: 2rem 0;
}
.app-card {
background: #3f3f44;
border: 1px solid #3f3f44;
border-radius: 8px;
padding: 1.5rem;
transition: all 0.3s ease;
position: relative;
overflow: hidden;
}
.app-card:hover {
transform: translateY(-4px);
box-shadow: 0 8px 16px rgba(0, 0, 0, 0.3);
border-color: #50ffff;
}
.app-card h3 {
margin-top: 0;
display: flex;
align-items: center;
gap: 0.5rem;
color: #e8e9ed;
}
.app-status {
display: inline-block;
padding: 0.25rem 0.75rem;
border-radius: 20px;
font-size: 0.7rem;
font-weight: 600;
text-transform: uppercase;
margin-bottom: 1rem;
}
.status-available {
background: #50ffff;
color: #070708;
}
.status-beta {
background: #f59e0b;
color: #070708;
}
.status-coming-soon {
background: #2a2a2a;
color: #888;
}
.app-description {
margin: 1rem 0;
line-height: 1.6;
color: #a3abba;
}
.app-features {
list-style: none;
padding: 0;
margin: 1rem 0;
}
.app-features li {
padding-left: 1.5rem;
position: relative;
margin-bottom: 0.5rem;
color: #d5cec0;
font-size: 0.9rem;
}
.app-features li:before {
content: "▸";
position: absolute;
left: 0;
color: #50ffff;
font-weight: bold;
}
.app-action {
margin-top: 1.5rem;
}
.app-btn {
display: inline-block;
padding: 0.75rem 1.5rem;
background: #50ffff;
color: #070708;
text-decoration: none;
border-radius: 6px;
font-weight: 600;
transition: all 0.2s ease;
font-family: dm, Monaco, monospace;
}
.app-btn:hover {
background: #09b5a5;
transform: scale(1.05);
color: #070708;
}
.app-btn.disabled {
background: #2a2a2a;
color: #666;
cursor: not-allowed;
transform: none;
}
.app-btn.disabled:hover {
background: #2a2a2a;
transform: none;
}
.intro-section {
background: #3f3f44;
border-radius: 8px;
padding: 2rem;
margin-bottom: 3rem;
border: 1px solid #3f3f44;
}
.intro-section h2 {
margin-top: 0;
color: #50ffff;
}
.intro-section p {
color: #d5cec0;
}
</style>
<div class="intro-section">
<h2>🛠️ Interactive Tools for Modern Web Scraping</h2>
<p>
Our apps are designed to make Crawl4AI more accessible and powerful. Whether you're learning browser automation, designing extraction strategies, or building complex scrapers, these tools provide visual, interactive ways to work with Crawl4AI's features.
</p>
</div>
## 🎯 Available Apps
<div class="apps-container">
<div class="app-card">
<span class="app-status status-available">Available</span>
<h3>🎨 C4A-Script Interactive Editor</h3>
<p class="app-description">
A visual, block-based programming environment for creating browser automation scripts. Perfect for beginners and experts alike!
</p>
<ul class="app-features">
<li>Drag-and-drop visual programming</li>
<li>Real-time JavaScript generation</li>
<li>Interactive tutorials</li>
<li>Export to C4A-Script or JavaScript</li>
<li>Live preview capabilities</li>
</ul>
<div class="app-action">
<a href="c4a-script/" class="app-btn" target="_blank">Launch Editor →</a>
</div>
</div>
<div class="app-card">
<span class="app-status status-available">Available</span>
<h3>🧠 LLM Context Builder</h3>
<p class="app-description">
Generate optimized context files for your favorite LLM when working with Crawl4AI. Get focused, relevant documentation based on your needs.
</p>
<ul class="app-features">
<li>Modular context generation</li>
<li>Memory, reasoning & examples perspectives</li>
<li>Component-based selection</li>
<li>Vibe coding preset</li>
<li>Download custom contexts</li>
</ul>
<div class="app-action">
<a href="llmtxt/" class="app-btn" target="_blank">Launch Builder →</a>
</div>
</div>
<div class="app-card">
<span class="app-status status-coming-soon">Coming Soon</span>
<h3>🕸️ Web Scraping Playground</h3>
<p class="app-description">
Test your scraping strategies on real websites with instant feedback. See how different configurations affect your results.
</p>
<ul class="app-features">
<li>Live website testing</li>
<li>Side-by-side result comparison</li>
<li>Performance metrics</li>
<li>Export configurations</li>
</ul>
<div class="app-action">
<a href="#" class="app-btn disabled">Coming Soon</a>
</div>
</div>
<div class="app-card">
<span class="app-status status-available">Available</span>
<h3>🔍 Crawl4AI Assistant (Chrome Extension)</h3>
<p class="app-description">
Visual schema builder Chrome extension - click on webpage elements to generate extraction schemas and Python code!
</p>
<ul class="app-features">
<li>Visual element selection</li>
<li>Container & field selection modes</li>
<li>Smart selector generation</li>
<li>Complete Python code generation</li>
<li>One-click installation</li>
</ul>
<div class="app-action">
<a href="crawl4ai-assistant/" class="app-btn">Install Extension →</a>
</div>
</div>
<div class="app-card">
<span class="app-status status-coming-soon">Coming Soon</span>
<h3>🧪 Extraction Lab</h3>
<p class="app-description">
Experiment with different extraction strategies and see how they perform on your content. Compare LLM vs CSS vs XPath approaches.
</p>
<ul class="app-features">
<li>Strategy comparison tools</li>
<li>Performance benchmarks</li>
<li>Cost estimation for LLM strategies</li>
<li>Best practice recommendations</li>
</ul>
<div class="app-action">
<a href="#" class="app-btn disabled">Coming Soon</a>
</div>
</div>
<div class="app-card">
<span class="app-status status-coming-soon">Coming Soon</span>
<h3>🤖 AI Prompt Designer</h3>
<p class="app-description">
Craft and test prompts for LLM-based extraction. See how different prompts affect extraction quality and costs.
</p>
<ul class="app-features">
<li>Prompt templates library</li>
<li>A/B testing interface</li>
<li>Token usage calculator</li>
<li>Quality metrics</li>
</ul>
<div class="app-action">
<a href="#" class="app-btn disabled">Coming Soon</a>
</div>
</div>
<div class="app-card">
<span class="app-status status-coming-soon">Coming Soon</span>
<h3>📊 Crawl Monitor</h3>
<p class="app-description">
Real-time monitoring dashboard for your crawling operations. Track performance, debug issues, and optimize your scrapers.
</p>
<ul class="app-features">
<li>Real-time crawl statistics</li>
<li>Error tracking and debugging</li>
<li>Resource usage monitoring</li>
<li>Historical analytics</li>
</ul>
<div class="app-action">
<a href="#" class="app-btn disabled">Coming Soon</a>
</div>
</div>
</div>
## 🚀 Why Use These Apps?
### 🎯 **Accelerate Learning**
Visual tools help you understand Crawl4AI's concepts faster than reading documentation alone.
### 💡 **Reduce Development Time**
Generate working code instantly instead of writing everything from scratch.
### 🔍 **Improve Quality**
Test and refine your approach before deploying to production.
### 🤝 **Community Driven**
These tools are built based on user feedback. Have an idea? [Let us know](https://github.com/unclecode/crawl4ai/issues)!
## 📢 Stay Updated
Want to know when new apps are released?
- ⭐ [Star us on GitHub](https://github.com/unclecode/crawl4ai) to get notifications
- 🐦 Follow [@unclecode](https://twitter.com/unclecode) for announcements
- 💬 Join our [Discord community](https://discord.gg/crawl4ai) for early access
---
!!! tip "Developer Resources"
Building your own tools with Crawl4AI? Check out our [API Reference](../api/async-webcrawler.md) and [Integration Guide](../advanced/advanced-features.md) for comprehensive documentation.

View File

@@ -0,0 +1,75 @@
O**Prompt for AI Coding Assistant: Create an Interactive LLM Context Builder Page**
**Objective:**
Your task is to create an interactive HTML webpage with JavaScript functionality that allows users to select and combine different `crawl4ai` LLM context files into a single downloadable Markdown (`.md`) file. This tool will empower users to craft tailored context for their AI assistants based on their specific needs.
**Core Functionality:**
1. **Display `crawl4ai` Components:** The page will list all available `crawl4ai` documentation components.
2. **Select Context Types:** For each component, users can select which types of context they want to include:
* Memory (API facts)
* Reasoning (How-to/why)
* Examples (Code snippets)
(All should be selected by default for each initially selected component).
3. **Special "Aggregate" Contexts:** Include options for special, pre-combined contexts:
* "Vibe Coding" (a curated mix for general AI prompting)
* "All Library Context" (a comprehensive aggregation of all memory, reasoning, and examples for the entire library).
4. **Fetch and Concatenate:** When the user clicks a "Download Combined Context" button:
* The JavaScript will fetch the content of all selected Markdown files from the server (from a predefined folder, e.g., `/llmtxt/`).
* It will concatenate the content of these files into a single string.
5. **Client-Side Download:** The concatenated content will be offered to the user as a download (e.g., `custom_crawl4ai_context.md`).
**Input/Assumptions:**
* **Context Files Location:** All individual context Markdown files are located on the server in a publicly accessible folder named `llmtxt/`.
* **File Naming Convention:** Files follow the pattern: `crawl4ai_{{component_name}}_[memory|reasoning|examples]_content.llm.md`.
* `{{component_name}}` can contain underscores (e.g., `deep_crawling`, `config_objects`).
* The special contexts will have names like `crawl4ai_vibe_content.llm.md` and `crawl4ai_all_content.llm.md`.
* **Component List:** You will be provided with a list of `crawl4ai` components. For this implementation, use the following list:
* `core`
* `config_objects`
* `deep_crawling`
* `deployment` (covers Installation & Docker Deployment)
* `extraction` (covers Structured Data Extraction)
* `markdown` (covers Markdown Generation Algorithm)
* `pdf_processing`
* *(No separate "Vibe Coding" or "All Library Context" in this list, as they are special top-level selections)*
**Detailed UI/UX Requirements:**
1. **Main Page Structure:**
* **Header:** "Crawl4AI Interactive LLM Context Builder"
* **Introduction:** Briefly explain the purpose of the tool (from the `USING_LLM_CONTEXTS.md` content you helped draft: "Supercharging Your AI Assistant...").
* **Selection Area:**
* **Special Aggregate Contexts (Radio Buttons or Prominent Checkboxes):**
* [ ] "Vibe Coding Context" (`crawl4ai_vibe_content.llm.md`)
* [ ] "All Library Context (Comprehensive)" (`crawl4ai_all_content.llm.md`)
* *Behavior:* Selecting one of these might disable individual component selections (or vice-versa) to avoid redundancy, or simply add them to the list. Consider user experience here. A simple approach is that if an aggregate is selected, it's the *only* thing downloaded.
* **Individual Component Selection (Table or List of Checkboxes):**
* A section titled "Select Individual Components & Context Types:"
* For each component in the provided list:
* A master checkbox for the component itself (e.g., `[ ] Core Functionality`). Selected by default.
* Nested checkboxes (indented or grouped) for context types, enabled only if the parent component is checked:
* `[x] Memory (API Facts)`
* `[x] Reasoning (How-to/Why)`
* `[x] Examples (Code Snippets)`
(These three sub-checkboxes should be selected by default if the parent component is selected).
* **Action Button:**
* A button: "Generate & Download Combined Context"
* **Status/Feedback Area:** (Optional, but good UX)
* Display messages like "Fetching files...", "Combining context...", "Download starting..." or error messages.
**Final Output:**
* A single HTML file (e.g., `interactive_context_builder.html`).
* Associated JavaScript code (can be inline within `<script>` tags or in a separate `.js` file).
* Associated CSS code (can be inline within `<style>` tags or in a separate `.css` file).
This interactive tool will greatly enhance the user experience for `crawl4ai` developers looking to leverage your specialized LLM contexts. Please ensure the JavaScript is robust and provides good user feedback.
---
This prompt should give your AI coding assistant a very clear set of requirements and guidelines for building the interactive context builder. Remember to provide it with the list of components as mentioned in the "Input/Assumptions" section.

View File

@@ -0,0 +1,135 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Crawl4AI LLM Context Builder</title>
<link rel="stylesheet" href="llmtxt.css">
</head>
<body>
<div class="terminal-container">
<div class="header">
<div class="header-content">
<div class="logo-section">
<img src="../../img/favicon-32x32.png" alt="Crawl4AI Logo" class="logo">
<div>
<h1>Crawl4AI LLM Context Builder</h1>
<p class="tagline">Multi-Dimensional Context for AI Assistants</p>
</div>
</div>
<nav class="nav-links">
<a href="../../" class="nav-link">← Back to Docs</a>
<a href="../" class="nav-link">All Apps</a>
<a href="https://github.com/unclecode/crawl4ai" class="nav-link" target="_blank">GitHub</a>
</nav>
</div>
</div>
<div class="content">
<section class="intro">
<div class="intro-header">
<h2>🧠 A New Approach to LLM Context</h2>
<p>
Traditional <code>llm.txt</code> files often fail with complex libraries like Crawl4AI. They dump massive amounts of API documentation, causing <strong>information overload</strong> and <strong>lost focus</strong>. They provide the "what" but miss the crucial "how" and "why" that makes AI assistants truly helpful.
</p>
</div>
<div class="intro-solution">
<h3>💡 The Solution: Multi-Dimensional, Modular Contexts</h3>
<p>
Inspired by modular libraries like Lodash, I've redesigned how we provide context to AI assistants. Instead of one monolithic file, Crawl4AI's documentation is organized by <strong>components</strong> and <strong>perspectives</strong>.
</p>
<div class="dimensions">
<div class="dimension">
<span class="badge memory">Memory</span>
<h4>The "What"</h4>
<p>Precise API facts, parameters, signatures, and configuration objects. Your unambiguous reference.</p>
</div>
<div class="dimension">
<span class="badge reasoning">Reasoning</span>
<h4>The "How" & "Why"</h4>
<p>Design principles, best practices, trade-offs, and workflows. Teaches AI to think like an expert.</p>
</div>
<div class="dimension">
<span class="badge examples">Examples</span>
<h4>The "Show Me"</h4>
<p>Runnable code snippets demonstrating patterns in action. Pure practical implementation.</p>
</div>
</div>
</div>
<div class="intro-benefits">
<p>
<strong>Why this matters:</strong> You can now give your AI assistant exactly what it needs - whether that's quick API lookups, help designing solutions, or seeing practical implementations. No more information overload, just focused, relevant context.
</p>
<p class="learn-more">
<a href="/blog/articles/llm-context-revolution" class="learn-more-link" target="_parent">📖 Read the full story behind this approach →</a>
</p>
</div>
</section>
<section class="builder">
<div class="component-selector" id="component-selector">
<h2>Select Components & Context Types</h2>
<div class="select-all-controls">
<button class="btn-small" id="select-all">Select All</button>
<button class="btn-small" id="deselect-all">Deselect All</button>
</div>
<div class="component-table-wrapper">
<table class="component-selection-table">
<thead>
<tr>
<th width="50"></th>
<th>Component</th>
<th class="clickable-header" data-type="memory">Memory<br><span class="header-subtitle">Full Content</span></th>
<th class="clickable-header" data-type="reasoning">Reasoning<br><span class="header-subtitle">Diagrams</span></th>
<th class="clickable-header" data-type="examples">Examples<br><span class="header-subtitle">Code</span></th>
</tr>
</thead>
<tbody id="components-tbody">
<!-- Components will be dynamically inserted here -->
</tbody>
</table>
</div>
</div>
<div class="action-area">
<div class="token-summary" id="token-summary">
<span class="token-label">Estimated Tokens:</span>
<span class="token-count" id="total-tokens">0</span>
</div>
<button class="download-btn" id="download-btn">
<span class="icon"></span> Generate & Download Context
</button>
<div class="status" id="status"></div>
</div>
</section>
<section class="reference-table">
<h2>Available Context Files</h2>
<div class="table-wrapper">
<table class="context-table">
<thead>
<tr>
<th>Component</th>
<th>Memory</th>
<th>Reasoning</th>
<th>Examples</th>
<th>Full</th>
</tr>
</thead>
<tbody id="reference-table-body">
<!-- Table rows will be dynamically inserted here -->
</tbody>
</table>
</div>
</section>
</div>
</div>
<script src="llmtxt.js"></script>
</body>
</html>

View File

@@ -0,0 +1,603 @@
/* Terminal Theme CSS for LLM Context Builder */
/* Font Face Definitions */
@font-face {
font-family: 'Dank Mono';
src: url('../assets/DankMono-Regular.woff2') format('woff2');
font-weight: 400;
font-style: normal;
font-display: swap;
}
@font-face {
font-family: 'Dank Mono';
src: url('../assets/DankMono-Bold.woff2') format('woff2');
font-weight: 700;
font-style: normal;
font-display: swap;
}
@font-face {
font-family: 'Dank Mono';
src: url('../assets/DankMono-Italic.woff2') format('woff2');
font-weight: 400;
font-style: italic;
font-display: swap;
}
:root {
--background-color: #070708;
--font-color: #e8e9ed;
--primary-color: #50ffff;
--primary-dimmed: #09b5a5;
--secondary-color: #d5cec0;
--tertiary-color: #a3abba;
--accent-color: rgb(243, 128, 245);
--error-color: #ff3c74;
--code-bg-color: #3f3f44;
--border-color: #3f3f44;
--hover-bg: #1a1a1c;
--success-color: #50ff50;
--bg-secondary: #1a1a1a;
--primary-green: #0fbbaa;
--font-primary: 'Dank Mono', dm, Monaco, Courier New, monospace;
}
* {
box-sizing: border-box;
}
body {
margin: 0;
padding: 0;
font-family: var(--font-primary);
font-size: 14px;
line-height: 1.5;
background-color: var(--background-color);
color: var(--font-color);
}
/* Terminal Container */
.terminal-container {
min-height: 100vh;
display: flex;
flex-direction: column;
}
/* Header - matching assistant layout */
.header {
background: var(--bg-secondary);
border-bottom: 1px solid var(--border-color);
padding: 1.5rem 0;
position: sticky;
top: 0;
z-index: 100;
backdrop-filter: blur(10px);
background: rgba(26, 26, 26, 0.95);
}
.header-content {
max-width: 1200px;
margin: 0 auto;
padding: 0 2rem;
display: flex;
justify-content: space-between;
align-items: center;
}
.logo-section {
display: flex;
align-items: center;
gap: 1rem;
}
.logo {
width: 48px;
height: 48px;
}
.logo-section h1 {
font-size: 1.75rem;
font-weight: 700;
color: var(--font-color);
margin: 0;
}
.tagline {
font-size: 0.875rem;
color: var(--tertiary-color);
margin-top: 0.25rem;
}
.nav-links {
display: flex;
gap: 2rem;
}
.nav-link {
color: var(--tertiary-color);
text-decoration: none;
font-size: 0.875rem;
transition: color 0.2s ease;
}
.nav-link:hover {
color: var(--primary-green);
}
/* Content */
.content {
flex: 1;
max-width: 1200px;
margin: 0 auto;
padding: 2rem;
width: 100%;
}
/* Intro Section */
.intro {
background-color: var(--code-bg-color);
border: 1px solid var(--border-color);
padding: 30px;
margin-bottom: 30px;
}
.intro-header h2 {
color: var(--primary-color);
margin: 0 0 15px 0;
font-size: 20px;
}
.intro-header p {
line-height: 1.6;
margin-bottom: 0;
}
.intro-header code {
background-color: var(--hover-bg);
padding: 2px 6px;
color: var(--primary-dimmed);
}
.intro-solution {
margin-top: 5px;
padding-top: 25px;
border-top: 1px dashed var(--border-color);
}
.intro-solution h3 {
color: var(--secondary-color);
margin: 0 0 15px 0;
font-size: 18px;
}
.dimensions {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
gap: 20px;
margin: 20px 0;
}
.dimension {
background-color: var(--hover-bg);
padding: 20px;
border: 1px solid var(--border-color);
transition: all 0.2s ease;
}
.dimension:hover {
border-color: var(--primary-dimmed);
}
.dimension h4 {
color: var(--font-color);
margin: 10px 0 8px 0;
font-size: 16px;
}
.dimension p {
font-size: 13px;
line-height: 1.5;
color: var(--tertiary-color);
margin: 0;
}
.intro-benefits {
margin-top: 0px;
padding-top: 0x;
border-top: 1px dashed var(--border-color);
}
.intro-benefits strong {
color: var(--primary-color);
}
.learn-more {
margin-top: 15px;
}
.learn-more-link {
color: var(--primary-dimmed);
text-decoration: none;
font-weight: bold;
transition: color 0.2s ease;
}
.learn-more-link:hover {
color: var(--primary-color);
text-decoration: underline;
}
.badge {
display: inline-block;
padding: 2px 8px;
font-size: 12px;
text-transform: uppercase;
margin-right: 8px;
}
.badge.memory {
background-color: var(--primary-dimmed);
color: var(--background-color);
}
.badge.reasoning {
background-color: var(--accent-color);
color: var(--background-color);
}
.badge.examples {
background-color: var(--secondary-color);
color: var(--background-color);
}
/* Builder Section */
.builder {
margin-bottom: 40px;
}
.builder h2 {
color: var(--primary-color);
font-size: 18px;
margin-bottom: 20px;
text-transform: uppercase;
}
/* Preset Options */
.preset-options {
display: flex;
gap: 20px;
margin-bottom: 30px;
}
.preset-option {
flex: 1;
cursor: pointer;
}
.preset-option input[type="radio"] {
display: none;
}
.preset-card {
border: 2px solid var(--border-color);
padding: 20px;
transition: all 0.2s ease;
background-color: var(--code-bg-color);
}
.preset-card h3 {
margin: 0 0 10px 0;
color: var(--secondary-color);
font-size: 16px;
}
.preset-card p {
margin: 0;
font-size: 12px;
color: var(--tertiary-color);
}
.preset-option input:checked + .preset-card {
border-color: var(--primary-color);
background-color: var(--hover-bg);
}
.preset-card:hover {
border-color: var(--primary-dimmed);
}
/* Component Selector */
.component-selector {
margin-bottom: 30px;
}
.select-all-controls {
margin-bottom: 20px;
}
.btn-small {
background-color: var(--code-bg-color);
color: var(--font-color);
border: 1px solid var(--border-color);
padding: 5px 15px;
margin-right: 10px;
cursor: pointer;
font-family: inherit;
font-size: 12px;
text-transform: uppercase;
transition: all 0.2s ease;
}
.btn-small:hover {
background-color: var(--primary-dimmed);
color: var(--background-color);
}
/* Component Selection Table */
.component-table-wrapper {
overflow-x: auto;
border: 1px solid var(--border-color);
margin-top: 20px;
}
.component-selection-table {
width: 100%;
border-collapse: collapse;
background-color: var(--code-bg-color);
}
.component-selection-table th,
.component-selection-table td {
padding: 12px;
text-align: left;
border-bottom: 1px solid var(--border-color);
}
.component-selection-table th {
background-color: var(--hover-bg);
color: var(--primary-color);
text-transform: uppercase;
font-size: 12px;
letter-spacing: 1px;
font-weight: bold;
}
.header-subtitle {
font-size: 10px;
color: var(--tertiary-color);
text-transform: none;
font-weight: normal;
display: block;
margin-top: 2px;
}
.component-selection-table th.clickable-header {
cursor: pointer;
user-select: none;
transition: background-color 0.2s ease;
}
.component-selection-table th.clickable-header:hover {
background-color: var(--primary-dimmed);
color: var(--background-color);
}
.component-selection-table th.clickable-header[data-type="examples"] {
cursor: default;
opacity: 0.5;
}
.component-selection-table th.clickable-header[data-type="examples"]:hover {
background-color: var(--hover-bg);
color: var(--primary-color);
}
.component-selection-table th:nth-child(3),
.component-selection-table th:nth-child(4),
.component-selection-table th:nth-child(5) {
text-align: center;
width: 120px;
}
.component-selection-table td {
font-size: 13px;
}
.component-selection-table td:nth-child(3),
.component-selection-table td:nth-child(4),
.component-selection-table td:nth-child(5) {
text-align: center;
}
.component-selection-table tr:hover td {
background-color: var(--hover-bg);
}
.component-name {
color: var(--primary-color);
font-weight: bold;
}
/* Token display in table cells */
.token-info {
display: block;
font-size: 11px;
color: var(--tertiary-color);
margin-top: 2px;
}
.component-selection-table input[type="checkbox"] {
cursor: pointer;
width: 16px;
height: 16px;
}
.component-selection-table input[type="checkbox"]:disabled {
cursor: not-allowed;
opacity: 0.3;
}
/* Disabled row state */
.component-selection-table tr.disabled td:not(:first-child) {
opacity: 0.5;
pointer-events: none;
}
/* Action Area */
.action-area {
text-align: center;
margin: 40px 0;
}
/* Token Summary */
.token-summary {
margin-bottom: 20px;
font-size: 16px;
}
.token-label {
color: var(--tertiary-color);
margin-right: 10px;
}
.token-count {
color: var(--primary-color);
font-weight: bold;
font-size: 20px;
}
.token-count::after {
content: " est.";
font-size: 12px;
color: var(--tertiary-color);
margin-left: 4px;
}
.download-btn {
background-color: var(--primary-dimmed);
color: var(--background-color);
border: none;
padding: 15px 40px;
font-size: 16px;
font-family: inherit;
cursor: pointer;
text-transform: uppercase;
letter-spacing: 1px;
transition: all 0.2s ease;
display: inline-flex;
align-items: center;
gap: 10px;
}
.download-btn:hover {
background-color: var(--primary-color);
transform: translateY(-2px);
}
.download-btn .icon {
font-size: 20px;
}
.status {
margin-top: 20px;
font-size: 14px;
min-height: 30px;
}
.status.loading {
color: var(--primary-color);
}
.status.success {
color: var(--success-color);
}
.status.error {
color: var(--error-color);
}
/* Reference Table */
.reference-table {
margin-top: 60px;
}
.reference-table h2 {
color: var(--primary-color);
font-size: 18px;
margin-bottom: 20px;
text-transform: uppercase;
}
.table-wrapper {
overflow-x: auto;
border: 1px solid var(--border-color);
}
.context-table {
width: 100%;
border-collapse: collapse;
background-color: var(--code-bg-color);
}
.context-table th,
.context-table td {
padding: 12px;
text-align: left;
border-bottom: 1px solid var(--border-color);
}
.context-table th {
background-color: var(--hover-bg);
color: var(--primary-color);
text-transform: uppercase;
font-size: 12px;
letter-spacing: 1px;
}
.context-table td {
font-size: 13px;
}
.context-table tr:hover td {
background-color: var(--hover-bg);
}
.file-link {
color: var(--primary-dimmed);
text-decoration: none;
}
.file-link:hover {
color: var(--primary-color);
text-decoration: underline;
}
.file-size {
color: var(--tertiary-color);
font-size: 11px;
}
/* Responsive Design */
@media (max-width: 768px) {
.header-content {
flex-direction: column;
gap: 1.5rem;
}
.nav-links {
gap: 1rem;
}
.preset-options {
flex-direction: column;
}
.components-grid {
grid-template-columns: 1fr;
}
.content {
padding: 1rem;
}
}

View File

@@ -0,0 +1,580 @@
// Crawl4AI LLM Context Builder JavaScript
// Component definitions - order matters
const components = [
{
id: 'installation',
name: 'Installation',
description: 'Setup and installation options'
},
{
id: 'simple_crawling',
name: 'Simple Crawling',
description: 'Basic web crawling operations'
},
{
id: 'config_objects',
name: 'Configuration Objects',
description: 'Browser and crawler configuration'
},
{
id: 'extraction-llm',
name: 'Data Extraction Using LLM',
description: 'Structured data extraction strategies using LLMs'
},
{
id: 'extraction-no-llm',
name: 'Data Extraction Without LLM',
description: 'Structured data extraction strategies without LLMs'
},
{
id: 'multi_urls_crawling',
name: 'Multi URLs Crawling',
description: 'Crawling multiple URLs efficiently'
},
{
id: 'deep_crawling',
name: 'Deep Crawling',
description: 'Multi-page crawling strategies'
},
{
id: 'docker',
name: 'Docker',
description: 'Docker deployment and configuration'
},
{
id: 'cli',
name: 'CLI',
description: 'Command-line interface usage'
},
{
id: 'http_based_crawler_strategy',
name: 'HTTP-based Crawler',
description: 'HTTP crawler strategy implementation'
},
{
id: 'url_seeder',
name: 'URL Seeder',
description: 'URL seeding and discovery'
},
{
id: 'deep_crawl_advanced_filters_scorers',
name: 'Advanced Filters & Scorers',
description: 'Deep crawl filtering and scoring'
}
];
// Context types
const contextTypes = ['memory', 'reasoning', 'examples'];
// State management
const state = {
selectedComponents: new Set(),
selectedContextTypes: new Map(),
tokenCounts: new Map() // Store token counts for each file
};
// Initialize the application
document.addEventListener('DOMContentLoaded', () => {
renderComponents();
renderReferenceTable();
setupActionHandlers();
setupColumnHeaderHandlers();
// Initialize first component as selected with available context types
const firstComponent = components[0];
state.selectedComponents.add(firstComponent.id);
state.selectedContextTypes.set(firstComponent.id, new Set(['memory', 'reasoning']));
updateComponentUI();
});
// Helper function to count tokens (words × 2.5)
function estimateTokens(text) {
if (!text) return 0;
const words = text.trim().split(/\s+/).length;
return Math.round(words * 2.5);
}
// Update total token count display
function updateTotalTokenCount() {
let totalTokens = 0;
state.selectedComponents.forEach(compId => {
const types = state.selectedContextTypes.get(compId);
if (types) {
types.forEach(type => {
const key = `${compId}-${type}`;
totalTokens += state.tokenCounts.get(key) || 0;
});
}
});
document.getElementById('total-tokens').textContent = totalTokens.toLocaleString();
}
// Render component selection table
function renderComponents() {
const tbody = document.getElementById('components-tbody');
tbody.innerHTML = '';
components.forEach(component => {
const row = createComponentRow(component);
tbody.appendChild(row);
});
// Fetch token counts for all files
fetchAllTokenCounts();
}
// Create a component table row
function createComponentRow(component) {
const tr = document.createElement('tr');
tr.id = `component-${component.id}`;
// Component checkbox cell
const checkboxCell = document.createElement('td');
checkboxCell.innerHTML = `
<input type="checkbox" id="check-${component.id}"
data-component="${component.id}">
`;
tr.appendChild(checkboxCell);
// Component name cell
const nameCell = document.createElement('td');
nameCell.innerHTML = `<span class="component-name">${component.name}</span>`;
tr.appendChild(nameCell);
// Context type cells
contextTypes.forEach(type => {
const td = document.createElement('td');
const key = `${component.id}-${type}`;
const tokenCount = state.tokenCounts.get(key) || 0;
const isDisabled = type === 'examples' ? 'disabled' : '';
td.innerHTML = `
<input type="checkbox" id="check-${component.id}-${type}"
data-component="${component.id}" data-type="${type}"
${isDisabled}>
<span class="token-info" id="tokens-${component.id}-${type}">
${tokenCount > 0 ? `${tokenCount.toLocaleString()} tokens` : ''}
</span>
`;
tr.appendChild(td);
});
// Add event listeners
const mainCheckbox = tr.querySelector(`#check-${component.id}`);
mainCheckbox.addEventListener('change', (e) => {
handleComponentToggle(component.id, e.target.checked);
});
// Add event listeners for context type checkboxes
contextTypes.forEach(type => {
const typeCheckbox = tr.querySelector(`#check-${component.id}-${type}`);
if (!typeCheckbox.disabled) {
typeCheckbox.addEventListener('change', (e) => {
handleContextTypeToggle(component.id, type, e.target.checked);
});
}
});
return tr;
}
// Handle component checkbox toggle
function handleComponentToggle(componentId, checked) {
if (checked) {
state.selectedComponents.add(componentId);
// Select only available context types when component is selected
if (!state.selectedContextTypes.has(componentId)) {
state.selectedContextTypes.set(componentId, new Set(['memory', 'reasoning']));
} else {
// If component was already partially selected, select all available
state.selectedContextTypes.set(componentId, new Set(['memory', 'reasoning']));
}
} else {
state.selectedComponents.delete(componentId);
state.selectedContextTypes.delete(componentId);
}
updateComponentUI();
}
// Handle component selection based on context types
function updateComponentSelection(componentId) {
const types = state.selectedContextTypes.get(componentId) || new Set();
if (types.size > 0) {
state.selectedComponents.add(componentId);
} else {
state.selectedComponents.delete(componentId);
}
}
// Handle context type checkbox toggle
function handleContextTypeToggle(componentId, type, checked) {
if (!state.selectedContextTypes.has(componentId)) {
state.selectedContextTypes.set(componentId, new Set());
}
const types = state.selectedContextTypes.get(componentId);
if (checked) {
types.add(type);
} else {
types.delete(type);
}
updateComponentSelection(componentId);
updateComponentUI();
}
// Update UI to reflect current state
function updateComponentUI() {
components.forEach(component => {
const row = document.getElementById(`component-${component.id}`);
if (!row) return;
const mainCheckbox = row.querySelector(`#check-${component.id}`);
const hasSelection = state.selectedComponents.has(component.id);
const selectedTypes = state.selectedContextTypes.get(component.id) || new Set();
// Update main checkbox
mainCheckbox.checked = hasSelection;
// Update row disabled state
row.classList.toggle('disabled', !hasSelection);
// Update context type checkboxes
contextTypes.forEach(type => {
const typeCheckbox = row.querySelector(`#check-${component.id}-${type}`);
typeCheckbox.checked = selectedTypes.has(type);
});
});
updateTotalTokenCount();
}
// Fetch token counts for all files
async function fetchAllTokenCounts() {
const promises = [];
components.forEach(component => {
contextTypes.forEach(type => {
promises.push(fetchTokenCount(component.id, type));
});
});
await Promise.all(promises);
updateComponentUI();
renderReferenceTable(); // Update reference table with token counts
}
// Fetch token count for a specific file
async function fetchTokenCount(componentId, type) {
const key = `${componentId}-${type}`;
try {
const fileName = getFileName(componentId, type);
const baseUrl = getBaseUrl(type);
const response = await fetch(baseUrl + fileName);
if (response.ok) {
const content = await response.text();
const tokens = estimateTokens(content);
state.tokenCounts.set(key, tokens);
// Update UI
const tokenSpan = document.getElementById(`tokens-${componentId}-${type}`);
if (tokenSpan) {
tokenSpan.textContent = `${tokens.toLocaleString()} tokens`;
}
} else if (type === 'examples') {
// Examples might not exist yet
state.tokenCounts.set(key, 0);
const tokenSpan = document.getElementById(`tokens-${componentId}-${type}`);
if (tokenSpan) {
tokenSpan.textContent = '';
}
}
} catch (error) {
console.warn(`Failed to fetch token count for ${componentId}-${type}`);
if (type === 'examples') {
const tokenSpan = document.getElementById(`tokens-${componentId}-${type}`);
if (tokenSpan) {
tokenSpan.textContent = '';
}
}
}
}
// Get file name based on component and type
function getFileName(componentId, type) {
// For new structure, all files are just [componentId].txt
return `${componentId}.txt`;
}
// Get base URL based on context type
function getBaseUrl(type) {
// For MkDocs, we need to go up to the root level
const basePrefix = window.location.pathname.includes('/apps/') ? '../../' : '/';
switch(type) {
case 'memory':
return basePrefix + 'assets/llm.txt/txt/';
case 'reasoning':
return basePrefix + 'assets/llm.txt/diagrams/';
case 'examples':
return basePrefix + 'assets/llm.txt/examples/'; // Will return 404 for now
default:
return basePrefix + 'assets/llm.txt/txt/';
}
}
// Setup action button handlers
function setupActionHandlers() {
// Select/Deselect all buttons
document.getElementById('select-all').addEventListener('click', () => {
components.forEach(comp => {
state.selectedComponents.add(comp.id);
state.selectedContextTypes.set(comp.id, new Set(['memory', 'reasoning']));
});
updateComponentUI();
});
document.getElementById('deselect-all').addEventListener('click', () => {
state.selectedComponents.clear();
state.selectedContextTypes.clear();
updateComponentUI();
});
// Download button
document.getElementById('download-btn').addEventListener('click', handleDownload);
}
// Setup column header click handlers
function setupColumnHeaderHandlers() {
const headers = document.querySelectorAll('.clickable-header');
headers.forEach(header => {
header.addEventListener('click', () => {
const type = header.getAttribute('data-type');
toggleColumnSelection(type);
});
});
}
// Toggle all checkboxes in a column
function toggleColumnSelection(type) {
// Don't toggle examples column
if (type === 'examples') return;
// Check if all are currently selected
let allSelected = true;
components.forEach(comp => {
const types = state.selectedContextTypes.get(comp.id);
if (!types || !types.has(type)) {
allSelected = false;
}
});
// Toggle all
components.forEach(comp => {
if (!state.selectedContextTypes.has(comp.id)) {
state.selectedContextTypes.set(comp.id, new Set());
}
const types = state.selectedContextTypes.get(comp.id);
if (allSelected) {
types.delete(type);
} else {
types.add(type);
}
updateComponentSelection(comp.id);
});
updateComponentUI();
}
// Handle download action
async function handleDownload() {
const statusEl = document.getElementById('status');
statusEl.textContent = 'Preparing context files...';
statusEl.className = 'status loading';
try {
const files = getSelectedFiles();
if (files.length === 0) {
throw new Error('No files selected. Please select at least one component or preset.');
}
statusEl.textContent = `Fetching ${files.length} files...`;
const contents = await fetchFiles(files);
const combined = combineContents(contents);
downloadFile(combined, 'crawl4ai_custom_context.md');
statusEl.textContent = 'Download complete!';
statusEl.className = 'status success';
setTimeout(() => {
statusEl.textContent = '';
statusEl.className = 'status';
}, 3000);
} catch (error) {
statusEl.textContent = `Error: ${error.message}`;
statusEl.className = 'status error';
}
}
// Get list of selected files based on current state
function getSelectedFiles() {
const files = [];
// Build list of selected files with their context info
state.selectedComponents.forEach(compId => {
const types = state.selectedContextTypes.get(compId);
if (types) {
types.forEach(type => {
files.push({
componentId: compId,
type: type,
fileName: getFileName(compId, type),
baseUrl: getBaseUrl(type)
});
});
}
});
return files;
}
// Fetch multiple files
async function fetchFiles(fileInfos) {
const promises = fileInfos.map(async (fileInfo) => {
try {
const response = await fetch(fileInfo.baseUrl + fileInfo.fileName);
if (!response.ok) {
if (fileInfo.type === 'examples') {
return {
fileInfo,
content: `<!-- Examples for ${fileInfo.componentId} coming soon -->\n\nExamples are currently being developed for this component.`
};
}
console.warn(`Failed to fetch ${fileInfo.fileName} from ${fileInfo.baseUrl + fileInfo.fileName}`);
return { fileInfo, content: `<!-- Failed to load ${fileInfo.fileName} -->` };
}
const content = await response.text();
return { fileInfo, content };
} catch (error) {
if (fileInfo.type === 'examples') {
return {
fileInfo,
content: `<!-- Examples for ${fileInfo.componentId} coming soon -->\n\nExamples are currently being developed for this component.`
};
}
console.warn(`Error fetching ${fileInfo.fileName}:`, error);
return { fileInfo, content: `<!-- Error loading ${fileInfo.fileName} -->` };
}
});
return Promise.all(promises);
}
// Combine file contents with headers
function combineContents(fileContents) {
// Calculate total tokens
let totalTokens = 0;
fileContents.forEach(({ content }) => {
totalTokens += estimateTokens(content);
});
const header = `# Crawl4AI Custom LLM Context
Generated on: ${new Date().toISOString()}
Total files: ${fileContents.length}
Estimated tokens: ${totalTokens.toLocaleString()}
---
`;
const sections = fileContents.map(({ fileInfo, content }) => {
const component = components.find(c => c.id === fileInfo.componentId);
const componentName = component ? component.name : fileInfo.componentId;
const contextType = getContextTypeName(fileInfo.type);
const tokens = estimateTokens(content);
return `## ${componentName} - ${contextType}
Component ID: ${fileInfo.componentId}
Context Type: ${fileInfo.type}
Estimated tokens: ${tokens.toLocaleString()}
${content}
---
`;
});
return header + sections.join('\n');
}
// Get display name for context type
function getContextTypeName(type) {
switch(type) {
case 'memory': return 'Full Content';
case 'reasoning': return 'Diagrams & Workflows';
case 'examples': return 'Code Examples';
default: return type;
}
}
// Download file to user's computer
function downloadFile(content, fileName) {
const blob = new Blob([content], { type: 'text/markdown' });
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = fileName;
document.body.appendChild(a);
a.click();
document.body.removeChild(a);
URL.revokeObjectURL(url);
}
// Render reference table
function renderReferenceTable() {
const tbody = document.getElementById('reference-table-body');
tbody.innerHTML = '';
// Get base path for links
const basePrefix = window.location.pathname.includes('/apps/') ? '../../' : '/';
components.forEach(component => {
const row = document.createElement('tr');
const memoryTokens = state.tokenCounts.get(`${component.id}-memory`) || 0;
const reasoningTokens = state.tokenCounts.get(`${component.id}-reasoning`) || 0;
const examplesTokens = state.tokenCounts.get(`${component.id}-examples`) || 0;
row.innerHTML = `
<td><strong>${component.name}</strong></td>
<td>
<a href="${basePrefix}assets/llm.txt/txt/${component.id}.txt" class="file-link" target="_blank">Memory</a>
${memoryTokens > 0 ? `<span class="file-size">${memoryTokens.toLocaleString()} tokens</span>` : ''}
</td>
<td>
<a href="${basePrefix}assets/llm.txt/diagrams/${component.id}.txt" class="file-link" target="_blank">Reasoning</a>
${reasoningTokens > 0 ? `<span class="file-size">${reasoningTokens.toLocaleString()} tokens</span>` : ''}
</td>
<td>
${examplesTokens > 0
? `<a href="${basePrefix}assets/llm.txt/examples/${component.id}.txt" class="file-link" target="_blank">Examples</a>
<span class="file-size">${examplesTokens.toLocaleString()} tokens</span>`
: '-'
}
</td>
<td>-</td>
`;
tbody.appendChild(row);
});
}

View File

@@ -0,0 +1,37 @@
# Supercharging Your AI Assistant: My Journey to Better LLM Contexts for `crawl4ai`
When I started diving deep into using AI coding assistants with my own libraries, particularly `crawl4ai`, I quickly realized that the common approach to providing context via a simple `llm.txt` or even a beefed-up `README.md` just wasn't cutting it. This document explains the problems I encountered and how I've tried to create a more effective system for `crawl4ai`, allowing you (and your AI assistant) to get precisely the right information.
## My Frustration with Standard `llm.txt` Files
My experience with generic `llm.txt` files for complex libraries like `crawl4ai` revealed several pain points:
1. **Information Overload & Lost Focus:** I found that when I threw a massive, monolithic context file at an LLM, it often struggled. The sheer volume of information seemed to dilute its focus. If I asked a specific question about a niche feature, the LLM might get sidetracked by more prominent but currently irrelevant parts of the library. It felt like trying to find a single sentence in a thousand-page novel the information was *there*, but not always accessible or prioritized correctly by the AI.
2. **The "What" Without the "How" or "Why":** Most `llm.txt` files I encountered were essentially API dumps a list of functions, classes, and parameters. This is the "what" of a library. But to truly use a library effectively, especially one as flexible as `crawl4ai`, you need the "how" (idiomatic usage patterns, best practices for common tasks) and the "why" (the design rationale behind certain features). Without this, I noticed my AI assistant would often generate syntactically correct but practically inefficient or non-idiomatic code. It was guessing the *intent* and the *best way* to use the library, and those guesses weren't always right.
3. **No Guidance on "Thinking" Like an Expert:** A static list of facts doesn't teach an LLM the *art* of using the library. It doesn't convey the trade-offs an experienced developer considers, the common pitfalls they've learned to avoid, or the clever ways to combine features to solve complex problems. I wanted my AI assistant to not just recall an API, but to help me *reason* about the best way to build a solution with `crawl4ai`.
## Inspiration: Selective Inclusion & Multi-Dimensional Understanding
I've always admired how libraries like Lodash or jQuery (in its modular days) allowed developers to pick and choose only the parts they needed, resulting in smaller, more focused bundles. This idea of modularity and selective inclusion resonated deeply with me as I thought about LLM context. Why force-feed an LLM the entire library's details when I'm only working on a specific component or task?
This led me to develop a new approach for `crawl4ai`: **multi-dimensional, modular contexts**.
Instead of one giant `llm.txt`, I've broken down the `crawl4ai` documentation into:
1. **Logical Components:** Context is organized around the major functional areas of the library (e.g., Core, Data Extraction, Deep Crawling, Markdown Generation, etc.). This allows you to select context relevant only to the task at hand.
2. **Three Dimensions of Context for Each Component:**
* **`_memory.md` (Foundational Memory):** This is the "what." It contains the precise, factual information about the component's public API, data structures, configuration objects, parameters, and method signatures. It's the detailed, unambiguous reference.
* **`_reasoning.md` (Reasoning & Problem-Solving Framework):** This is the "how" and "why." It includes design principles, common task workflows with decision guides, best practices, anti-patterns, illustrative code examples solving real problems, and explanations of trade-offs. It aims to guide the LLM in "thinking" like an expert `crawl4ai` user.
* **`_examples.md` (Practical Code Examples):** This is pure "show-me-the-code." It's a collection of runnable snippets demonstrating various ways to use the component's features and configurations, with minimal explanatory text. Its for quickly seeing different patterns in action.
**The Goal:**
My aim is to provide you with a flexible system. You can give your AI assistant:
* Just the **memory** files for quick API lookups.
* The **reasoning** files (perhaps with memory) for help designing solutions.
* The **examples** files for seeing practical implementations.
* A **combination** of these across one or more components tailored to your specific task.
* Or, for broader understanding, special aggregate contexts like the "Vibe Coding" context or the "All Library Context."
By providing these structured, multi-faceted contexts, I hope to significantly improve the quality and relevance of the assistance you get when using AI to code with `crawl4ai`. The following sections will guide you on how to select and use these different context files.

View File

@@ -0,0 +1,37 @@
/* docs/assets/feedback-overrides.css */
:root {
/* brand */
--feedback-primary-color: #09b5a5;
--feedback-highlight-color: #fed500; /* stars etc */
/* modal shell / text */
--feedback-modal-content-bg-color: var(--background-color);
--feedback-modal-content-text-color: var(--font-color);
--feedback-modal-content-border-color: var(--primary-dimmed-color);
--feedback-modal-content-border-radius: 4px;
/* overlay */
--feedback-overlay-bg-color: rgba(0,0,0,.75);
/* rating buttons */
--feedback-modal-rating-button-color: var(--secondary-color);
--feedback-modal-rating-button-selected-color: var(--primary-color);
/* inputs */
--feedback-modal-input-bg-color: var(--code-bg-color);
--feedback-modal-input-text-color: var(--font-color);
--feedback-modal-input-border-color: var(--primary-dimmed-color);
--feedback-modal-input-border-color-focused: var(--primary-color);
/* submit / secondary buttons */
--feedback-modal-button-submit-bg-color: var(--primary-color);
--feedback-modal-button-submit-bg-color-hover: var(--primary-dimmed-color);
--feedback-modal-button-submit-text-color: var(--invert-font-color);
--feedback-modal-button-bg-color: transparent; /* screenshot btn */
--feedback-modal-button-border-color: var(--primary-color);
--feedback-modal-button-icon-color: var(--primary-color);
}
/* optional: keep the “Powered by” link subtle */
.feedback-logo a{color:var(--secondary-color);}

View File

@@ -0,0 +1,5 @@
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-58W0K2ZQ25');

View File

@@ -0,0 +1,425 @@
## CLI Workflows and Profile Management
Visual representations of command-line interface operations, browser profile management, and identity-based crawling workflows.
### CLI Command Flow Architecture
```mermaid
flowchart TD
A[crwl command] --> B{Command Type?}
B -->|URL Crawling| C[Parse URL & Options]
B -->|Profile Management| D[profiles subcommand]
B -->|CDP Browser| E[cdp subcommand]
B -->|Browser Control| F[browser subcommand]
B -->|Configuration| G[config subcommand]
C --> C1{Output Format?}
C1 -->|Default| C2[HTML/Markdown]
C1 -->|JSON| C3[Structured Data]
C1 -->|markdown| C4[Clean Markdown]
C1 -->|markdown-fit| C5[Filtered Content]
C --> C6{Authentication?}
C6 -->|Profile Specified| C7[Load Browser Profile]
C6 -->|No Profile| C8[Anonymous Session]
C7 --> C9[Launch with User Data]
C8 --> C10[Launch Clean Browser]
C9 --> C11[Execute Crawl]
C10 --> C11
C11 --> C12{Success?}
C12 -->|Yes| C13[Return Results]
C12 -->|No| C14[Error Handling]
D --> D1[Interactive Profile Menu]
D1 --> D2{Menu Choice?}
D2 -->|Create| D3[Open Browser for Setup]
D2 -->|List| D4[Show Existing Profiles]
D2 -->|Delete| D5[Remove Profile]
D2 -->|Use| D6[Crawl with Profile]
E --> E1[Launch CDP Browser]
E1 --> E2[Remote Debugging Active]
F --> F1{Browser Action?}
F1 -->|start| F2[Start Builtin Browser]
F1 -->|stop| F3[Stop Builtin Browser]
F1 -->|status| F4[Check Browser Status]
F1 -->|view| F5[Open Browser Window]
G --> G1{Config Action?}
G1 -->|list| G2[Show All Settings]
G1 -->|set| G3[Update Setting]
G1 -->|get| G4[Read Setting]
style A fill:#e1f5fe
style C13 fill:#c8e6c9
style C14 fill:#ffcdd2
style D3 fill:#fff3e0
style E2 fill:#f3e5f5
```
### Profile Management Workflow
```mermaid
sequenceDiagram
participant User
participant CLI
participant ProfileManager
participant Browser
participant FileSystem
User->>CLI: crwl profiles
CLI->>ProfileManager: Initialize profile manager
ProfileManager->>FileSystem: Scan for existing profiles
FileSystem-->>ProfileManager: Profile list
ProfileManager-->>CLI: Show interactive menu
CLI-->>User: Display options
Note over User: User selects "Create new profile"
User->>CLI: Create profile "linkedin-auth"
CLI->>ProfileManager: create_profile("linkedin-auth")
ProfileManager->>FileSystem: Create profile directory
ProfileManager->>Browser: Launch with new user data dir
Browser-->>User: Opens browser window
Note over User: User manually logs in to LinkedIn
User->>Browser: Navigate and authenticate
Browser->>FileSystem: Save cookies, session data
User->>CLI: Press 'q' to save profile
CLI->>ProfileManager: finalize_profile()
ProfileManager->>FileSystem: Lock profile settings
ProfileManager-->>CLI: Profile saved
CLI-->>User: Profile "linkedin-auth" created
Note over User: Later usage
User->>CLI: crwl https://linkedin.com/feed -p linkedin-auth
CLI->>ProfileManager: load_profile("linkedin-auth")
ProfileManager->>FileSystem: Read profile data
FileSystem-->>ProfileManager: User data directory
ProfileManager-->>CLI: Profile configuration
CLI->>Browser: Launch with existing profile
Browser-->>CLI: Authenticated session ready
CLI->>Browser: Navigate to target URL
Browser-->>CLI: Crawl results with auth context
CLI-->>User: Authenticated content
```
### Browser Management State Machine
```mermaid
stateDiagram-v2
[*] --> Stopped: Initial state
Stopped --> Starting: crwl browser start
Starting --> Running: Browser launched
Running --> Viewing: crwl browser view
Viewing --> Running: Close window
Running --> Stopping: crwl browser stop
Stopping --> Stopped: Cleanup complete
Running --> Restarting: crwl browser restart
Restarting --> Running: New browser instance
Stopped --> CDP_Mode: crwl cdp
CDP_Mode --> CDP_Running: Remote debugging active
CDP_Running --> CDP_Mode: Manual close
CDP_Mode --> Stopped: Exit CDP
Running --> StatusCheck: crwl browser status
StatusCheck --> Running: Return status
note right of Running : Port 9222 active\nBuiltin browser available
note right of CDP_Running : Remote debugging\nManual control enabled
note right of Viewing : Visual browser window\nDirect interaction
```
### Authentication Workflow for Protected Sites
```mermaid
flowchart TD
A[Protected Site Access Needed] --> B[Create Profile Strategy]
B --> C{Existing Profile?}
C -->|Yes| D[Test Profile Validity]
C -->|No| E[Create New Profile]
D --> D1{Profile Valid?}
D1 -->|Yes| F[Use Existing Profile]
D1 -->|No| E
E --> E1[crwl profiles]
E1 --> E2[Select Create New Profile]
E2 --> E3[Enter Profile Name]
E3 --> E4[Browser Opens for Auth]
E4 --> E5{Authentication Method?}
E5 -->|Login Form| E6[Fill Username/Password]
E5 -->|OAuth| E7[OAuth Flow]
E5 -->|2FA| E8[Handle 2FA]
E5 -->|Session Cookie| E9[Import Cookies]
E6 --> E10[Manual Login Process]
E7 --> E10
E8 --> E10
E9 --> E10
E10 --> E11[Verify Authentication]
E11 --> E12{Auth Successful?}
E12 -->|Yes| E13[Save Profile - Press q]
E12 -->|No| E10
E13 --> F
F --> G[Execute Authenticated Crawl]
G --> H[crwl URL -p profile-name]
H --> I[Load Profile Data]
I --> J[Launch Browser with Auth]
J --> K[Navigate to Protected Content]
K --> L[Extract Authenticated Data]
L --> M[Return Results]
style E4 fill:#fff3e0
style E10 fill:#e3f2fd
style F fill:#e8f5e8
style M fill:#c8e6c9
```
### CDP Browser Architecture
```mermaid
graph TB
subgraph "CLI Layer"
A[crwl cdp command] --> B[CDP Manager]
B --> C[Port Configuration]
B --> D[Profile Selection]
end
subgraph "Browser Process"
E[Chromium/Firefox] --> F[Remote Debugging]
F --> G[WebSocket Endpoint]
G --> H[ws://localhost:9222]
end
subgraph "Client Connections"
I[Manual Browser Control] --> H
J[DevTools Interface] --> H
K[External Automation] --> H
L[Crawl4AI Crawler] --> H
end
subgraph "Profile Data"
M[User Data Directory] --> E
N[Cookies & Sessions] --> M
O[Extensions] --> M
P[Browser State] --> M
end
A --> E
C --> H
D --> M
style H fill:#e3f2fd
style E fill:#f3e5f5
style M fill:#e8f5e8
```
### Configuration Management Hierarchy
```mermaid
graph TD
subgraph "Global Configuration"
A[~/.crawl4ai/config.yml] --> B[Default Settings]
B --> C[LLM Providers]
B --> D[Browser Defaults]
B --> E[Output Preferences]
end
subgraph "Profile Configuration"
F[Profile Directory] --> G[Browser State]
F --> H[Authentication Data]
F --> I[Site-Specific Settings]
end
subgraph "Command-Line Overrides"
J[-b browser_config] --> K[Runtime Browser Settings]
L[-c crawler_config] --> M[Runtime Crawler Settings]
N[-o output_format] --> O[Runtime Output Format]
end
subgraph "Configuration Files"
P[browser.yml] --> Q[Browser Config Template]
R[crawler.yml] --> S[Crawler Config Template]
T[extract.yml] --> U[Extraction Config]
end
subgraph "Resolution Order"
V[Command Line Args] --> W[Config Files]
W --> X[Profile Settings]
X --> Y[Global Defaults]
end
J --> V
L --> V
N --> V
P --> W
R --> W
T --> W
F --> X
A --> Y
style V fill:#ffcdd2
style W fill:#fff3e0
style X fill:#e3f2fd
style Y fill:#e8f5e8
```
### Identity-Based Crawling Decision Tree
```mermaid
flowchart TD
A[Target Website Assessment] --> B{Authentication Required?}
B -->|No| C[Standard Anonymous Crawl]
B -->|Yes| D{Authentication Type?}
D -->|Login Form| E[Create Login Profile]
D -->|OAuth/SSO| F[Create OAuth Profile]
D -->|API Key/Token| G[Use Headers/Config]
D -->|Session Cookies| H[Import Cookie Profile]
E --> E1[crwl profiles → Manual login]
F --> F1[crwl profiles → OAuth flow]
G --> G1[Configure headers in crawler config]
H --> H1[Import cookies to profile]
E1 --> I[Test Authentication]
F1 --> I
G1 --> I
H1 --> I
I --> J{Auth Test Success?}
J -->|Yes| K[Production Crawl Setup]
J -->|No| L[Debug Authentication]
L --> L1{Common Issues?}
L1 -->|Rate Limiting| L2[Add delays/user simulation]
L1 -->|Bot Detection| L3[Enable stealth mode]
L1 -->|Session Expired| L4[Refresh authentication]
L1 -->|CAPTCHA| L5[Manual intervention needed]
L2 --> M[Retry with Adjustments]
L3 --> M
L4 --> E1
L5 --> N[Semi-automated approach]
M --> I
N --> O[Manual auth + automated crawl]
K --> P[Automated Authenticated Crawling]
O --> P
C --> P
P --> Q[Monitor & Maintain Profiles]
style I fill:#fff3e0
style K fill:#e8f5e8
style P fill:#c8e6c9
style L fill:#ffcdd2
style N fill:#f3e5f5
```
### CLI Usage Patterns and Best Practices
```mermaid
timeline
title CLI Workflow Evolution
section Setup Phase
Installation : pip install crawl4ai
: crawl4ai-setup
Basic Test : crwl https://example.com
Config Setup : crwl config set defaults
section Profile Creation
Site Analysis : Identify auth requirements
Profile Creation : crwl profiles
Manual Login : Authenticate in browser
Profile Save : Press 'q' to save
section Development Phase
Test Crawls : crwl URL -p profile -v
Config Tuning : Adjust browser/crawler settings
Output Testing : Try different output formats
Error Handling : Debug authentication issues
section Production Phase
Automated Crawls : crwl URL -p profile -o json
Batch Processing : Multiple URLs with same profile
Monitoring : Check profile validity
Maintenance : Update profiles as needed
```
### Multi-Profile Management Strategy
```mermaid
graph LR
subgraph "Profile Categories"
A[Social Media Profiles]
B[Work/Enterprise Profiles]
C[E-commerce Profiles]
D[Research Profiles]
end
subgraph "Social Media"
A --> A1[linkedin-personal]
A --> A2[twitter-monitor]
A --> A3[facebook-research]
A --> A4[instagram-brand]
end
subgraph "Enterprise"
B --> B1[company-intranet]
B --> B2[github-enterprise]
B --> B3[confluence-docs]
B --> B4[jira-tickets]
end
subgraph "E-commerce"
C --> C1[amazon-seller]
C --> C2[shopify-admin]
C --> C3[ebay-monitor]
C --> C4[marketplace-competitor]
end
subgraph "Research"
D --> D1[academic-journals]
D --> D2[data-platforms]
D --> D3[survey-tools]
D --> D4[government-portals]
end
subgraph "Usage Patterns"
E[Daily Monitoring] --> A2
E --> B1
F[Weekly Reports] --> C3
F --> D2
G[On-Demand Research] --> D1
G --> D4
H[Competitive Analysis] --> C4
H --> A4
end
style A1 fill:#e3f2fd
style B1 fill:#f3e5f5
style C1 fill:#e8f5e8
style D1 fill:#fff3e0
```
**📖 Learn more:** [CLI Reference](https://docs.crawl4ai.com/core/cli/), [Identity-Based Crawling](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Profile Management](https://docs.crawl4ai.com/advanced/session-management/), [Authentication Strategies](https://docs.crawl4ai.com/advanced/hooks-auth/)

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,401 @@
## Deep Crawling Filters & Scorers Architecture
Visual representations of advanced URL filtering, scoring strategies, and performance optimization workflows for intelligent deep crawling.
### Filter Chain Processing Pipeline
```mermaid
flowchart TD
A[URL Input] --> B{Domain Filter}
B -->|✓ Pass| C{Pattern Filter}
B -->|✗ Fail| X1[Reject: Invalid Domain]
C -->|✓ Pass| D{Content Type Filter}
C -->|✗ Fail| X2[Reject: Pattern Mismatch]
D -->|✓ Pass| E{SEO Filter}
D -->|✗ Fail| X3[Reject: Wrong Content Type]
E -->|✓ Pass| F{Content Relevance Filter}
E -->|✗ Fail| X4[Reject: Low SEO Score]
F -->|✓ Pass| G[URL Accepted]
F -->|✗ Fail| X5[Reject: Low Relevance]
G --> H[Add to Crawl Queue]
subgraph "Fast Filters"
B
C
D
end
subgraph "Slow Filters"
E
F
end
style A fill:#e3f2fd
style G fill:#c8e6c9
style H fill:#e8f5e8
style X1 fill:#ffcdd2
style X2 fill:#ffcdd2
style X3 fill:#ffcdd2
style X4 fill:#ffcdd2
style X5 fill:#ffcdd2
```
### URL Scoring System Architecture
```mermaid
graph TB
subgraph "Input URL"
A[https://python.org/tutorial/2024/ml-guide.html]
end
subgraph "Individual Scorers"
B[Keyword Relevance Scorer]
C[Path Depth Scorer]
D[Content Type Scorer]
E[Freshness Scorer]
F[Domain Authority Scorer]
end
subgraph "Scoring Process"
B --> B1[Keywords: python, tutorial, ml<br/>Score: 0.85]
C --> C1[Depth: 4 levels<br/>Optimal: 3<br/>Score: 0.75]
D --> D1[Content: HTML<br/>Score: 1.0]
E --> E1[Year: 2024<br/>Score: 1.0]
F --> F1[Domain: python.org<br/>Score: 1.0]
end
subgraph "Composite Scoring"
G[Weighted Combination]
B1 --> G
C1 --> G
D1 --> G
E1 --> G
F1 --> G
end
subgraph "Final Result"
H[Composite Score: 0.92]
I{Score > Threshold?}
J[Accept URL]
K[Reject URL]
end
A --> B
A --> C
A --> D
A --> E
A --> F
G --> H
H --> I
I -->|✓ 0.92 > 0.6| J
I -->|✗ Score too low| K
style A fill:#e3f2fd
style G fill:#fff3e0
style H fill:#e8f5e8
style J fill:#c8e6c9
style K fill:#ffcdd2
```
### Filter vs Scorer Decision Matrix
```mermaid
flowchart TD
A[URL Processing Decision] --> B{Binary Decision Needed?}
B -->|Yes - Include/Exclude| C[Use Filters]
B -->|No - Quality Rating| D[Use Scorers]
C --> C1{Filter Type Needed?}
C1 -->|Domain Control| C2[DomainFilter]
C1 -->|Pattern Matching| C3[URLPatternFilter]
C1 -->|Content Type| C4[ContentTypeFilter]
C1 -->|SEO Quality| C5[SEOFilter]
C1 -->|Content Relevance| C6[ContentRelevanceFilter]
D --> D1{Scoring Criteria?}
D1 -->|Keyword Relevance| D2[KeywordRelevanceScorer]
D1 -->|URL Structure| D3[PathDepthScorer]
D1 -->|Content Quality| D4[ContentTypeScorer]
D1 -->|Time Sensitivity| D5[FreshnessScorer]
D1 -->|Source Authority| D6[DomainAuthorityScorer]
C2 --> E[Chain Filters]
C3 --> E
C4 --> E
C5 --> E
C6 --> E
D2 --> F[Composite Scorer]
D3 --> F
D4 --> F
D5 --> F
D6 --> F
E --> G[Binary Output: Pass/Fail]
F --> H[Numeric Score: 0.0-1.0]
G --> I[Apply to URL Queue]
H --> J[Priority Ranking]
style C fill:#e8f5e8
style D fill:#fff3e0
style E fill:#f3e5f5
style F fill:#e3f2fd
style G fill:#c8e6c9
style H fill:#ffecb3
```
### Performance Optimization Strategy
```mermaid
sequenceDiagram
participant Queue as URL Queue
participant Fast as Fast Filters
participant Slow as Slow Filters
participant Score as Scorers
participant Output as Filtered URLs
Note over Queue, Output: Batch Processing (1000 URLs)
Queue->>Fast: Apply Domain Filter
Fast-->>Queue: 60% passed (600 URLs)
Queue->>Fast: Apply Pattern Filter
Fast-->>Queue: 70% passed (420 URLs)
Queue->>Fast: Apply Content Type Filter
Fast-->>Queue: 90% passed (378 URLs)
Note over Fast: Fast filters eliminate 62% of URLs
Queue->>Slow: Apply SEO Filter (378 URLs)
Slow-->>Queue: 80% passed (302 URLs)
Queue->>Slow: Apply Relevance Filter
Slow-->>Queue: 75% passed (227 URLs)
Note over Slow: Content analysis on remaining URLs
Queue->>Score: Calculate Composite Scores
Score-->>Queue: Scored and ranked
Queue->>Output: Top 100 URLs by score
Output-->>Queue: Processing complete
Note over Queue, Output: Total: 90% filtered out, 10% high-quality URLs retained
```
### Custom Filter Implementation Flow
```mermaid
stateDiagram-v2
[*] --> Planning
Planning --> IdentifyNeeds: Define filtering criteria
IdentifyNeeds --> ChooseType: Binary vs Scoring decision
ChooseType --> FilterImpl: Binary decision needed
ChooseType --> ScorerImpl: Quality rating needed
FilterImpl --> InheritURLFilter: Extend URLFilter base class
ScorerImpl --> InheritURLScorer: Extend URLScorer base class
InheritURLFilter --> ImplementApply: def apply(url) -> bool
InheritURLScorer --> ImplementScore: def _calculate_score(url) -> float
ImplementApply --> AddLogic: Add custom filtering logic
ImplementScore --> AddLogic
AddLogic --> TestFilter: Unit testing
TestFilter --> OptimizePerf: Performance optimization
OptimizePerf --> Integration: Integrate with FilterChain
Integration --> Production: Deploy to production
Production --> Monitor: Monitor performance
Monitor --> Tune: Tune parameters
Tune --> Production
note right of Planning : Consider performance impact
note right of AddLogic : Handle edge cases
note right of OptimizePerf : Cache frequently accessed data
```
### Filter Chain Optimization Patterns
```mermaid
graph TB
subgraph "Naive Approach - Poor Performance"
A1[All URLs] --> B1[Slow Filter 1]
B1 --> C1[Slow Filter 2]
C1 --> D1[Fast Filter 1]
D1 --> E1[Fast Filter 2]
E1 --> F1[Final Results]
B1 -.->|High CPU| G1[Performance Issues]
C1 -.->|Network Calls| G1
end
subgraph "Optimized Approach - High Performance"
A2[All URLs] --> B2[Fast Filter 1]
B2 --> C2[Fast Filter 2]
C2 --> D2[Batch Process]
D2 --> E2[Slow Filter 1]
E2 --> F2[Slow Filter 2]
F2 --> G2[Final Results]
D2 --> H2[Concurrent Processing]
H2 --> I2[Semaphore Control]
end
subgraph "Performance Metrics"
J[Processing Time]
K[Memory Usage]
L[CPU Utilization]
M[Network Requests]
end
G1 -.-> J
G1 -.-> K
G1 -.-> L
G1 -.-> M
G2 -.-> J
G2 -.-> K
G2 -.-> L
G2 -.-> M
style A1 fill:#ffcdd2
style G1 fill:#ffcdd2
style A2 fill:#c8e6c9
style G2 fill:#c8e6c9
style H2 fill:#e8f5e8
style I2 fill:#e8f5e8
```
### Composite Scoring Weight Distribution
```mermaid
pie title Composite Scorer Weight Distribution
"Keyword Relevance (30%)" : 30
"Domain Authority (25%)" : 25
"Content Type (20%)" : 20
"Freshness (15%)" : 15
"Path Depth (10%)" : 10
```
### Deep Crawl Integration Architecture
```mermaid
graph TD
subgraph "Deep Crawl Strategy"
A[Start URL] --> B[Extract Links]
B --> C[Apply Filter Chain]
C --> D[Calculate Scores]
D --> E[Priority Queue]
E --> F[Crawl Next URL]
F --> B
end
subgraph "Filter Chain Components"
C --> C1[Domain Filter]
C --> C2[Pattern Filter]
C --> C3[Content Filter]
C --> C4[SEO Filter]
C --> C5[Relevance Filter]
end
subgraph "Scoring Components"
D --> D1[Keyword Scorer]
D --> D2[Depth Scorer]
D --> D3[Freshness Scorer]
D --> D4[Authority Scorer]
D --> D5[Composite Score]
end
subgraph "Queue Management"
E --> E1{Score > Threshold?}
E1 -->|Yes| E2[High Priority Queue]
E1 -->|No| E3[Low Priority Queue]
E2 --> F
E3 --> G[Delayed Processing]
end
subgraph "Control Flow"
H{Max Depth Reached?}
I{Max Pages Reached?}
J[Stop Crawling]
end
F --> H
H -->|No| I
H -->|Yes| J
I -->|No| B
I -->|Yes| J
style A fill:#e3f2fd
style E2 fill:#c8e6c9
style E3 fill:#fff3e0
style J fill:#ffcdd2
```
### Filter Performance Comparison
```mermaid
xychart-beta
title "Filter Performance Comparison (1000 URLs)"
x-axis [Domain, Pattern, ContentType, SEO, Relevance]
y-axis "Processing Time (ms)" 0 --> 1000
bar [50, 80, 45, 300, 800]
```
### Scoring Algorithm Workflow
```mermaid
flowchart TD
A[Input URL] --> B[Parse URL Components]
B --> C[Extract Features]
C --> D[Domain Analysis]
C --> E[Path Analysis]
C --> F[Content Type Detection]
C --> G[Keyword Extraction]
C --> H[Freshness Detection]
D --> I[Domain Authority Score]
E --> J[Path Depth Score]
F --> K[Content Type Score]
G --> L[Keyword Relevance Score]
H --> M[Freshness Score]
I --> N[Apply Weights]
J --> N
K --> N
L --> N
M --> N
N --> O[Normalize Scores]
O --> P[Calculate Final Score]
P --> Q{Score >= Threshold?}
Q -->|Yes| R[Accept for Crawling]
Q -->|No| S[Reject URL]
R --> T[Add to Priority Queue]
S --> U[Log Rejection Reason]
style A fill:#e3f2fd
style P fill:#fff3e0
style R fill:#c8e6c9
style S fill:#ffcdd2
style T fill:#e8f5e8
```
**📖 Learn more:** [Deep Crawling Strategy](https://docs.crawl4ai.com/core/deep-crawling/), [Performance Optimization](https://docs.crawl4ai.com/advanced/performance-tuning/), [Custom Implementations](https://docs.crawl4ai.com/advanced/custom-filters/)

View File

@@ -0,0 +1,428 @@
## Deep Crawling Workflows and Architecture
Visual representations of multi-level website exploration, filtering strategies, and intelligent crawling patterns.
### Deep Crawl Strategy Overview
```mermaid
flowchart TD
A[Start Deep Crawl] --> B{Strategy Selection}
B -->|Explore All Levels| C[BFS Strategy]
B -->|Dive Deep Fast| D[DFS Strategy]
B -->|Smart Prioritization| E[Best-First Strategy]
C --> C1[Breadth-First Search]
C1 --> C2[Process all depth 0 links]
C2 --> C3[Process all depth 1 links]
C3 --> C4[Continue by depth level]
D --> D1[Depth-First Search]
D1 --> D2[Follow first link deeply]
D2 --> D3[Backtrack when max depth reached]
D3 --> D4[Continue with next branch]
E --> E1[Best-First Search]
E1 --> E2[Score all discovered URLs]
E2 --> E3[Process highest scoring URLs first]
E3 --> E4[Continuously re-prioritize queue]
C4 --> F[Apply Filters]
D4 --> F
E4 --> F
F --> G{Filter Chain Processing}
G -->|Domain Filter| G1[Check allowed/blocked domains]
G -->|URL Pattern Filter| G2[Match URL patterns]
G -->|Content Type Filter| G3[Verify content types]
G -->|SEO Filter| G4[Evaluate SEO quality]
G -->|Content Relevance| G5[Score content relevance]
G1 --> H{Passed All Filters?}
G2 --> H
G3 --> H
G4 --> H
G5 --> H
H -->|Yes| I[Add to Crawl Queue]
H -->|No| J[Discard URL]
I --> K{Processing Mode}
K -->|Streaming| L[Process Immediately]
K -->|Batch| M[Collect All Results]
L --> N[Stream Result to User]
M --> O[Return Complete Result Set]
J --> P{More URLs in Queue?}
N --> P
O --> P
P -->|Yes| Q{Within Limits?}
P -->|No| R[Deep Crawl Complete]
Q -->|Max Depth OK| S{Max Pages OK}
Q -->|Max Depth Exceeded| T[Skip Deeper URLs]
S -->|Under Limit| U[Continue Crawling]
S -->|Limit Reached| R
T --> P
U --> F
style A fill:#e1f5fe
style R fill:#c8e6c9
style C fill:#fff3e0
style D fill:#f3e5f5
style E fill:#e8f5e8
```
### Deep Crawl Strategy Comparison
```mermaid
graph TB
subgraph "BFS - Breadth-First Search"
BFS1[Level 0: Start URL]
BFS2[Level 1: All direct links]
BFS3[Level 2: All second-level links]
BFS4[Level 3: All third-level links]
BFS1 --> BFS2
BFS2 --> BFS3
BFS3 --> BFS4
BFS_NOTE[Complete each depth before going deeper<br/>Good for site mapping<br/>Memory intensive for wide sites]
end
subgraph "DFS - Depth-First Search"
DFS1[Start URL]
DFS2[First Link → Deep]
DFS3[Follow until max depth]
DFS4[Backtrack and try next]
DFS1 --> DFS2
DFS2 --> DFS3
DFS3 --> DFS4
DFS4 --> DFS2
DFS_NOTE[Go deep on first path<br/>Memory efficient<br/>May miss important pages]
end
subgraph "Best-First - Priority Queue"
BF1[Start URL]
BF2[Score all discovered links]
BF3[Process highest scoring first]
BF4[Continuously re-prioritize]
BF1 --> BF2
BF2 --> BF3
BF3 --> BF4
BF4 --> BF2
BF_NOTE[Intelligent prioritization<br/>Finds relevant content fast<br/>Recommended for most use cases]
end
style BFS1 fill:#e3f2fd
style DFS1 fill:#f3e5f5
style BF1 fill:#e8f5e8
style BFS_NOTE fill:#fff3e0
style DFS_NOTE fill:#fff3e0
style BF_NOTE fill:#fff3e0
```
### Filter Chain Processing Sequence
```mermaid
sequenceDiagram
participant URL as Discovered URL
participant Chain as Filter Chain
participant Domain as Domain Filter
participant Pattern as URL Pattern Filter
participant Content as Content Type Filter
participant SEO as SEO Filter
participant Relevance as Content Relevance Filter
participant Queue as Crawl Queue
URL->>Chain: Process URL
Chain->>Domain: Check domain rules
alt Domain Allowed
Domain-->>Chain: ✓ Pass
Chain->>Pattern: Check URL patterns
alt Pattern Matches
Pattern-->>Chain: ✓ Pass
Chain->>Content: Check content type
alt Content Type Valid
Content-->>Chain: ✓ Pass
Chain->>SEO: Evaluate SEO quality
alt SEO Score Above Threshold
SEO-->>Chain: ✓ Pass
Chain->>Relevance: Score content relevance
alt Relevance Score High
Relevance-->>Chain: ✓ Pass
Chain->>Queue: Add to crawl queue
Queue-->>URL: Queued for crawling
else Relevance Score Low
Relevance-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Low relevance
end
else SEO Score Low
SEO-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Poor SEO
end
else Invalid Content Type
Content-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Wrong content type
end
else Pattern Mismatch
Pattern-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Pattern mismatch
end
else Domain Blocked
Domain-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Blocked domain
end
```
### URL Lifecycle State Machine
```mermaid
stateDiagram-v2
[*] --> Discovered: Found on page
Discovered --> FilterPending: Enter filter chain
FilterPending --> DomainCheck: Apply domain filter
DomainCheck --> PatternCheck: Domain allowed
DomainCheck --> Rejected: Domain blocked
PatternCheck --> ContentCheck: Pattern matches
PatternCheck --> Rejected: Pattern mismatch
ContentCheck --> SEOCheck: Content type valid
ContentCheck --> Rejected: Invalid content
SEOCheck --> RelevanceCheck: SEO score sufficient
SEOCheck --> Rejected: Poor SEO score
RelevanceCheck --> Scored: Relevance score calculated
RelevanceCheck --> Rejected: Low relevance
Scored --> Queued: Added to priority queue
Queued --> Crawling: Selected for processing
Crawling --> Success: Page crawled successfully
Crawling --> Failed: Crawl failed
Success --> LinkExtraction: Extract new links
LinkExtraction --> [*]: Process complete
Failed --> [*]: Record failure
Rejected --> [*]: Log rejection reason
note right of Scored : Score determines priority<br/>in Best-First strategy
note right of Failed : Errors logged with<br/>depth and reason
```
### Streaming vs Batch Processing Architecture
```mermaid
graph TB
subgraph "Input"
A[Start URL] --> B[Deep Crawl Strategy]
end
subgraph "Crawl Engine"
B --> C[URL Discovery]
C --> D[Filter Chain]
D --> E[Priority Queue]
E --> F[Page Processor]
end
subgraph "Streaming Mode stream=True"
F --> G1[Process Page]
G1 --> H1[Extract Content]
H1 --> I1[Yield Result Immediately]
I1 --> J1[async for result]
J1 --> K1[Real-time Processing]
G1 --> L1[Extract Links]
L1 --> M1[Add to Queue]
M1 --> F
end
subgraph "Batch Mode stream=False"
F --> G2[Process Page]
G2 --> H2[Extract Content]
H2 --> I2[Store Result]
I2 --> N2[Result Collection]
G2 --> L2[Extract Links]
L2 --> M2[Add to Queue]
M2 --> O2{More URLs?}
O2 -->|Yes| F
O2 -->|No| P2[Return All Results]
P2 --> Q2[Batch Processing]
end
style I1 fill:#e8f5e8
style K1 fill:#e8f5e8
style P2 fill:#e3f2fd
style Q2 fill:#e3f2fd
```
### Advanced Scoring and Prioritization System
```mermaid
flowchart LR
subgraph "URL Discovery"
A[Page Links] --> B[Extract URLs]
B --> C[Normalize URLs]
end
subgraph "Scoring System"
C --> D[Keyword Relevance Scorer]
D --> D1[URL Text Analysis]
D --> D2[Keyword Matching]
D --> D3[Calculate Base Score]
D3 --> E[Additional Scoring Factors]
E --> E1[URL Structure weight: 0.2]
E --> E2[Link Context weight: 0.3]
E --> E3[Page Depth Penalty weight: 0.1]
E --> E4[Domain Authority weight: 0.4]
D1 --> F[Combined Score]
D2 --> F
D3 --> F
E1 --> F
E2 --> F
E3 --> F
E4 --> F
end
subgraph "Prioritization"
F --> G{Score Threshold}
G -->|Above Threshold| H[Priority Queue]
G -->|Below Threshold| I[Discard URL]
H --> J[Best-First Selection]
J --> K[Highest Score First]
K --> L[Process Page]
L --> M[Update Scores]
M --> N[Re-prioritize Queue]
N --> J
end
style F fill:#fff3e0
style H fill:#e8f5e8
style L fill:#e3f2fd
```
### Deep Crawl Performance and Limits
```mermaid
graph TD
subgraph "Crawl Constraints"
A[Max Depth: 2] --> A1[Prevents infinite crawling]
B[Max Pages: 50] --> B1[Controls resource usage]
C[Score Threshold: 0.3] --> C1[Quality filtering]
D[Domain Limits] --> D1[Scope control]
end
subgraph "Performance Monitoring"
E[Pages Crawled] --> F[Depth Distribution]
E --> G[Success Rate]
E --> H[Average Score]
E --> I[Processing Time]
F --> J[Performance Report]
G --> J
H --> J
I --> J
end
subgraph "Resource Management"
K[Memory Usage] --> L{Memory Threshold}
L -->|Under Limit| M[Continue Crawling]
L -->|Over Limit| N[Reduce Concurrency]
O[CPU Usage] --> P{CPU Threshold}
P -->|Normal| M
P -->|High| Q[Add Delays]
R[Network Load] --> S{Rate Limits}
S -->|OK| M
S -->|Exceeded| T[Throttle Requests]
end
M --> U[Optimal Performance]
N --> V[Reduced Performance]
Q --> V
T --> V
style U fill:#c8e6c9
style V fill:#fff3e0
style J fill:#e3f2fd
```
### Error Handling and Recovery Flow
```mermaid
sequenceDiagram
participant Strategy as Deep Crawl Strategy
participant Queue as Priority Queue
participant Crawler as Page Crawler
participant Error as Error Handler
participant Result as Result Collector
Strategy->>Queue: Get next URL
Queue-->>Strategy: Return highest priority URL
Strategy->>Crawler: Crawl page
alt Successful Crawl
Crawler-->>Strategy: Return page content
Strategy->>Result: Store successful result
Strategy->>Strategy: Extract new links
Strategy->>Queue: Add new URLs to queue
else Network Error
Crawler-->>Error: Network timeout/failure
Error->>Error: Log error with details
Error->>Queue: Mark URL as failed
Error-->>Strategy: Skip to next URL
else Parse Error
Crawler-->>Error: HTML parsing failed
Error->>Error: Log parse error
Error->>Result: Store failed result
Error-->>Strategy: Continue with next URL
else Rate Limit Hit
Crawler-->>Error: Rate limit exceeded
Error->>Error: Apply backoff strategy
Error->>Queue: Re-queue URL with delay
Error-->>Strategy: Wait before retry
else Depth Limit
Strategy->>Strategy: Check depth constraint
Strategy-->>Queue: Skip URL - too deep
else Page Limit
Strategy->>Strategy: Check page count
Strategy-->>Result: Stop crawling - limit reached
end
Strategy->>Queue: Request next URL
Queue-->>Strategy: More URLs available?
alt Queue Empty
Queue-->>Result: Crawl complete
else Queue Has URLs
Queue-->>Strategy: Continue crawling
end
```
**📖 Learn more:** [Deep Crawling Strategies](https://docs.crawl4ai.com/core/deep-crawling/), [Content Filtering](https://docs.crawl4ai.com/core/content-selection/), [Advanced Crawling Patterns](https://docs.crawl4ai.com/advanced/advanced-features/)

View File

@@ -0,0 +1,603 @@
## Docker Deployment Architecture and Workflows
Visual representations of Crawl4AI Docker deployment, API architecture, configuration management, and service interactions.
### Docker Deployment Decision Flow
```mermaid
flowchart TD
A[Start Docker Deployment] --> B{Deployment Type?}
B -->|Quick Start| C[Pre-built Image]
B -->|Development| D[Docker Compose]
B -->|Custom Build| E[Manual Build]
B -->|Production| F[Production Setup]
C --> C1[docker pull unclecode/crawl4ai]
C1 --> C2{Need LLM Support?}
C2 -->|Yes| C3[Setup .llm.env]
C2 -->|No| C4[Basic run]
C3 --> C5[docker run with --env-file]
C4 --> C6[docker run basic]
D --> D1[git clone repository]
D1 --> D2[cp .llm.env.example .llm.env]
D2 --> D3{Build Type?}
D3 -->|Pre-built| D4[IMAGE=latest docker compose up]
D3 -->|Local Build| D5[docker compose up --build]
D3 -->|All Features| D6[INSTALL_TYPE=all docker compose up]
E --> E1[docker buildx build]
E1 --> E2{Architecture?}
E2 -->|Single| E3[--platform linux/amd64]
E2 -->|Multi| E4[--platform linux/amd64,linux/arm64]
E3 --> E5[Build complete]
E4 --> E5
F --> F1[Production configuration]
F1 --> F2[Custom config.yml]
F2 --> F3[Resource limits]
F3 --> F4[Health monitoring]
F4 --> F5[Production ready]
C5 --> G[Service running on :11235]
C6 --> G
D4 --> G
D5 --> G
D6 --> G
E5 --> H[docker run custom image]
H --> G
F5 --> I[Production deployment]
G --> J[Access playground at /playground]
G --> K[Health check at /health]
I --> L[Production monitoring]
style A fill:#e1f5fe
style G fill:#c8e6c9
style I fill:#c8e6c9
style J fill:#fff3e0
style K fill:#fff3e0
style L fill:#e8f5e8
```
### Docker Container Architecture
```mermaid
graph TB
subgraph "Host Environment"
A[Docker Engine] --> B[Crawl4AI Container]
C[.llm.env] --> B
D[Custom config.yml] --> B
E[Port 11235] --> B
F[Shared Memory 1GB+] --> B
end
subgraph "Container Services"
B --> G[FastAPI Server :8020]
B --> H[Gunicorn WSGI]
B --> I[Supervisord Process Manager]
B --> J[Redis Cache :6379]
G --> K[REST API Endpoints]
G --> L[WebSocket Connections]
G --> M[MCP Protocol]
H --> N[Worker Processes]
I --> O[Service Monitoring]
J --> P[Request Caching]
end
subgraph "Browser Management"
B --> Q[Playwright Framework]
Q --> R[Chromium Browser]
Q --> S[Firefox Browser]
Q --> T[WebKit Browser]
R --> U[Browser Pool]
S --> U
T --> U
U --> V[Page Sessions]
U --> W[Context Management]
end
subgraph "External Services"
X[OpenAI API] -.-> K
Y[Anthropic Claude] -.-> K
Z[Local Ollama] -.-> K
AA[Groq API] -.-> K
BB[Google Gemini] -.-> K
end
subgraph "Client Interactions"
CC[Python SDK] --> K
DD[REST API Calls] --> K
EE[MCP Clients] --> M
FF[Web Browser] --> G
GG[Monitoring Tools] --> K
end
style B fill:#e3f2fd
style G fill:#f3e5f5
style Q fill:#e8f5e8
style K fill:#fff3e0
```
### API Endpoints Architecture
```mermaid
graph LR
subgraph "Core Endpoints"
A[/crawl] --> A1[Single URL crawl]
A2[/crawl/stream] --> A3[Streaming multi-URL]
A4[/crawl/job] --> A5[Async job submission]
A6[/crawl/job/{id}] --> A7[Job status check]
end
subgraph "Specialized Endpoints"
B[/html] --> B1[Preprocessed HTML]
B2[/screenshot] --> B3[PNG capture]
B4[/pdf] --> B5[PDF generation]
B6[/execute_js] --> B7[JavaScript execution]
B8[/md] --> B9[Markdown extraction]
end
subgraph "Utility Endpoints"
C[/health] --> C1[Service status]
C2[/metrics] --> C3[Prometheus metrics]
C4[/schema] --> C5[API documentation]
C6[/playground] --> C7[Interactive testing]
end
subgraph "LLM Integration"
D[/llm/{url}] --> D1[Q&A over URL]
D2[/ask] --> D3[Library context search]
D4[/config/dump] --> D5[Config validation]
end
subgraph "MCP Protocol"
E[/mcp/sse] --> E1[Server-Sent Events]
E2[/mcp/ws] --> E3[WebSocket connection]
E4[/mcp/schema] --> E5[MCP tool definitions]
end
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
style E fill:#fce4ec
```
### Request Processing Flow
```mermaid
sequenceDiagram
participant Client
participant FastAPI
participant RequestValidator
participant BrowserPool
participant Playwright
participant ExtractionEngine
participant LLMProvider
Client->>FastAPI: POST /crawl with config
FastAPI->>RequestValidator: Validate JSON structure
alt Valid Request
RequestValidator-->>FastAPI: ✓ Validated
FastAPI->>BrowserPool: Request browser instance
BrowserPool->>Playwright: Launch browser/reuse session
Playwright-->>BrowserPool: Browser ready
BrowserPool-->>FastAPI: Browser allocated
FastAPI->>Playwright: Navigate to URL
Playwright->>Playwright: Execute JS, wait conditions
Playwright-->>FastAPI: Page content ready
FastAPI->>ExtractionEngine: Process content
alt LLM Extraction
ExtractionEngine->>LLMProvider: Send content + schema
LLMProvider-->>ExtractionEngine: Structured data
else CSS Extraction
ExtractionEngine->>ExtractionEngine: Apply CSS selectors
end
ExtractionEngine-->>FastAPI: Extraction complete
FastAPI->>BrowserPool: Release browser
FastAPI-->>Client: CrawlResult response
else Invalid Request
RequestValidator-->>FastAPI: ✗ Validation error
FastAPI-->>Client: 400 Bad Request
end
```
### Configuration Management Flow
```mermaid
stateDiagram-v2
[*] --> ConfigLoading
ConfigLoading --> DefaultConfig: Load default config.yml
ConfigLoading --> CustomConfig: Custom config mounted
ConfigLoading --> EnvOverrides: Environment variables
DefaultConfig --> ConfigMerging
CustomConfig --> ConfigMerging
EnvOverrides --> ConfigMerging
ConfigMerging --> ConfigValidation
ConfigValidation --> Valid: Schema validation passes
ConfigValidation --> Invalid: Validation errors
Invalid --> ConfigError: Log errors and exit
ConfigError --> [*]
Valid --> ServiceInitialization
ServiceInitialization --> FastAPISetup
ServiceInitialization --> BrowserPoolInit
ServiceInitialization --> CacheSetup
FastAPISetup --> Running
BrowserPoolInit --> Running
CacheSetup --> Running
Running --> ConfigReload: Config change detected
ConfigReload --> ConfigValidation
Running --> [*]: Service shutdown
note right of ConfigMerging : Priority: ENV > Custom > Default
note right of ServiceInitialization : All services must initialize successfully
```
### Multi-Architecture Build Process
```mermaid
flowchart TD
A[Developer Push] --> B[GitHub Repository]
B --> C[Docker Buildx]
C --> D{Build Strategy}
D -->|Multi-arch| E[Parallel Builds]
D -->|Single-arch| F[Platform-specific Build]
E --> G[AMD64 Build]
E --> H[ARM64 Build]
F --> I[Target Platform Build]
subgraph "AMD64 Build Process"
G --> G1[Ubuntu base image]
G1 --> G2[Python 3.11 install]
G2 --> G3[System dependencies]
G3 --> G4[Crawl4AI installation]
G4 --> G5[Playwright setup]
G5 --> G6[FastAPI configuration]
G6 --> G7[AMD64 image ready]
end
subgraph "ARM64 Build Process"
H --> H1[Ubuntu ARM64 base]
H1 --> H2[Python 3.11 install]
H2 --> H3[ARM-specific deps]
H3 --> H4[Crawl4AI installation]
H4 --> H5[Playwright setup]
H5 --> H6[FastAPI configuration]
H6 --> H7[ARM64 image ready]
end
subgraph "Single Architecture"
I --> I1[Base image selection]
I1 --> I2[Platform dependencies]
I2 --> I3[Application setup]
I3 --> I4[Platform image ready]
end
G7 --> J[Multi-arch Manifest]
H7 --> J
I4 --> K[Platform Image]
J --> L[Docker Hub Registry]
K --> L
L --> M[Pull Request Auto-selects Architecture]
style A fill:#e1f5fe
style J fill:#c8e6c9
style K fill:#c8e6c9
style L fill:#f3e5f5
style M fill:#e8f5e8
```
### MCP Integration Architecture
```mermaid
graph TB
subgraph "MCP Client Applications"
A[Claude Code] --> B[MCP Protocol]
C[Cursor IDE] --> B
D[Windsurf] --> B
E[Custom MCP Client] --> B
end
subgraph "Crawl4AI MCP Server"
B --> F[MCP Endpoint Router]
F --> G[SSE Transport /mcp/sse]
F --> H[WebSocket Transport /mcp/ws]
F --> I[Schema Endpoint /mcp/schema]
G --> J[MCP Tool Handler]
H --> J
J --> K[Tool: md]
J --> L[Tool: html]
J --> M[Tool: screenshot]
J --> N[Tool: pdf]
J --> O[Tool: execute_js]
J --> P[Tool: crawl]
J --> Q[Tool: ask]
end
subgraph "Crawl4AI Core Services"
K --> R[Markdown Generator]
L --> S[HTML Preprocessor]
M --> T[Screenshot Service]
N --> U[PDF Generator]
O --> V[JavaScript Executor]
P --> W[Batch Crawler]
Q --> X[Context Search]
R --> Y[Browser Pool]
S --> Y
T --> Y
U --> Y
V --> Y
W --> Y
X --> Z[Knowledge Base]
end
subgraph "External Resources"
Y --> AA[Playwright Browsers]
Z --> BB[Library Documentation]
Z --> CC[Code Examples]
AA --> DD[Web Pages]
end
style B fill:#e3f2fd
style J fill:#f3e5f5
style Y fill:#e8f5e8
style Z fill:#fff3e0
```
### API Request/Response Flow Patterns
```mermaid
sequenceDiagram
participant Client
participant LoadBalancer
participant FastAPI
participant ConfigValidator
participant BrowserManager
participant CrawlEngine
participant ResponseBuilder
Note over Client,ResponseBuilder: Basic Crawl Request
Client->>LoadBalancer: POST /crawl
LoadBalancer->>FastAPI: Route request
FastAPI->>ConfigValidator: Validate browser_config
ConfigValidator-->>FastAPI: ✓ Valid BrowserConfig
FastAPI->>ConfigValidator: Validate crawler_config
ConfigValidator-->>FastAPI: ✓ Valid CrawlerRunConfig
FastAPI->>BrowserManager: Allocate browser
BrowserManager-->>FastAPI: Browser instance
FastAPI->>CrawlEngine: Execute crawl
Note over CrawlEngine: Page processing
CrawlEngine->>CrawlEngine: Navigate & wait
CrawlEngine->>CrawlEngine: Extract content
CrawlEngine->>CrawlEngine: Apply strategies
CrawlEngine-->>FastAPI: CrawlResult
FastAPI->>ResponseBuilder: Format response
ResponseBuilder-->>FastAPI: JSON response
FastAPI->>BrowserManager: Release browser
FastAPI-->>LoadBalancer: Response ready
LoadBalancer-->>Client: 200 OK + CrawlResult
Note over Client,ResponseBuilder: Streaming Request
Client->>FastAPI: POST /crawl/stream
FastAPI-->>Client: 200 OK (stream start)
loop For each URL
FastAPI->>CrawlEngine: Process URL
CrawlEngine-->>FastAPI: Result ready
FastAPI-->>Client: NDJSON line
end
FastAPI-->>Client: Stream completed
```
### Configuration Validation Workflow
```mermaid
flowchart TD
A[Client Request] --> B[JSON Payload]
B --> C{Pre-validation}
C -->|✓ Valid JSON| D[Extract Configurations]
C -->|✗ Invalid JSON| E[Return 400 Bad Request]
D --> F[BrowserConfig Validation]
D --> G[CrawlerRunConfig Validation]
F --> H{BrowserConfig Valid?}
G --> I{CrawlerRunConfig Valid?}
H -->|✓ Valid| J[Browser Setup]
H -->|✗ Invalid| K[Log Browser Config Errors]
I -->|✓ Valid| L[Crawler Setup]
I -->|✗ Invalid| M[Log Crawler Config Errors]
K --> N[Collect All Errors]
M --> N
N --> O[Return 422 Validation Error]
J --> P{Both Configs Valid?}
L --> P
P -->|✓ Yes| Q[Proceed to Crawling]
P -->|✗ No| O
Q --> R[Execute Crawl Pipeline]
R --> S[Return CrawlResult]
E --> T[Client Error Response]
O --> T
S --> U[Client Success Response]
style A fill:#e1f5fe
style Q fill:#c8e6c9
style S fill:#c8e6c9
style U fill:#c8e6c9
style E fill:#ffcdd2
style O fill:#ffcdd2
style T fill:#ffcdd2
```
### Production Deployment Architecture
```mermaid
graph TB
subgraph "Load Balancer Layer"
A[NGINX/HAProxy] --> B[Health Check]
A --> C[Request Routing]
A --> D[SSL Termination]
end
subgraph "Application Layer"
C --> E[Crawl4AI Instance 1]
C --> F[Crawl4AI Instance 2]
C --> G[Crawl4AI Instance N]
E --> H[FastAPI Server]
F --> I[FastAPI Server]
G --> J[FastAPI Server]
H --> K[Browser Pool 1]
I --> L[Browser Pool 2]
J --> M[Browser Pool N]
end
subgraph "Shared Services"
N[Redis Cluster] --> E
N --> F
N --> G
O[Monitoring Stack] --> P[Prometheus]
O --> Q[Grafana]
O --> R[AlertManager]
P --> E
P --> F
P --> G
end
subgraph "External Dependencies"
S[OpenAI API] -.-> H
T[Anthropic API] -.-> I
U[Local LLM Cluster] -.-> J
end
subgraph "Persistent Storage"
V[Configuration Volume] --> E
V --> F
V --> G
W[Cache Volume] --> N
X[Logs Volume] --> O
end
style A fill:#e3f2fd
style E fill:#f3e5f5
style F fill:#f3e5f5
style G fill:#f3e5f5
style N fill:#e8f5e8
style O fill:#fff3e0
```
### Docker Resource Management
```mermaid
graph TD
subgraph "Resource Allocation"
A[Host Resources] --> B[CPU Cores]
A --> C[Memory GB]
A --> D[Disk Space]
A --> E[Network Bandwidth]
B --> F[Container Limits]
C --> F
D --> F
E --> F
end
subgraph "Container Configuration"
F --> G[--cpus=4]
F --> H[--memory=8g]
F --> I[--shm-size=2g]
F --> J[Volume Mounts]
G --> K[Browser Processes]
H --> L[Browser Memory]
I --> M[Shared Memory for Browsers]
J --> N[Config & Cache Storage]
end
subgraph "Monitoring & Scaling"
O[Resource Monitor] --> P[CPU Usage %]
O --> Q[Memory Usage %]
O --> R[Request Queue Length]
P --> S{CPU > 80%?}
Q --> T{Memory > 90%?}
R --> U{Queue > 100?}
S -->|Yes| V[Scale Up]
T -->|Yes| V
U -->|Yes| V
V --> W[Add Container Instance]
W --> X[Update Load Balancer]
end
subgraph "Performance Optimization"
Y[Browser Pool Tuning] --> Z[Max Pages: 40]
Y --> AA[Idle TTL: 30min]
Y --> BB[Concurrency Limits]
Z --> CC[Memory Efficiency]
AA --> DD[Resource Cleanup]
BB --> EE[Throughput Control]
end
style A fill:#e1f5fe
style F fill:#f3e5f5
style O fill:#e8f5e8
style Y fill:#fff3e0
```
**📖 Learn more:** [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/), [API Reference](https://docs.crawl4ai.com/api/), [MCP Integration](https://docs.crawl4ai.com/core/docker-deployment/#mcp-model-context-protocol-support), [Production Configuration](https://docs.crawl4ai.com/core/docker-deployment/#production-deployment)

View File

@@ -0,0 +1,478 @@
## Extraction Strategy Workflows and Architecture
Visual representations of Crawl4AI's data extraction approaches, strategy selection, and processing workflows.
### Extraction Strategy Decision Tree
```mermaid
flowchart TD
A[Content to Extract] --> B{Content Type?}
B -->|Simple Patterns| C[Common Data Types]
B -->|Structured HTML| D[Predictable Structure]
B -->|Complex Content| E[Requires Reasoning]
B -->|Mixed Content| F[Multiple Data Types]
C --> C1{Pattern Type?}
C1 -->|Email, Phone, URLs| C2[Built-in Regex Patterns]
C1 -->|Custom Patterns| C3[Custom Regex Strategy]
C1 -->|LLM-Generated| C4[One-time Pattern Generation]
D --> D1{Selector Type?}
D1 -->|CSS Selectors| D2[JsonCssExtractionStrategy]
D1 -->|XPath Expressions| D3[JsonXPathExtractionStrategy]
D1 -->|Need Schema?| D4[Auto-generate Schema with LLM]
E --> E1{LLM Provider?}
E1 -->|OpenAI/Anthropic| E2[Cloud LLM Strategy]
E1 -->|Local Ollama| E3[Local LLM Strategy]
E1 -->|Cost-sensitive| E4[Hybrid: Generate Schema Once]
F --> F1[Multi-Strategy Approach]
F1 --> F2[1. Regex for Patterns]
F1 --> F3[2. CSS for Structure]
F1 --> F4[3. LLM for Complex Analysis]
C2 --> G[Fast Extraction ⚡]
C3 --> G
C4 --> H[Cached Pattern Reuse]
D2 --> I[Schema-based Extraction 🏗️]
D3 --> I
D4 --> J[Generated Schema Cache]
E2 --> K[Intelligent Parsing 🧠]
E3 --> K
E4 --> L[Hybrid Cost-Effective]
F2 --> M[Comprehensive Results 📊]
F3 --> M
F4 --> M
style G fill:#c8e6c9
style I fill:#e3f2fd
style K fill:#fff3e0
style M fill:#f3e5f5
style H fill:#e8f5e8
style J fill:#e8f5e8
style L fill:#ffecb3
```
### LLM Extraction Strategy Workflow
```mermaid
sequenceDiagram
participant User
participant Crawler
participant LLMStrategy
participant Chunker
participant LLMProvider
participant Parser
User->>Crawler: Configure LLMExtractionStrategy
User->>Crawler: arun(url, config)
Crawler->>Crawler: Navigate to URL
Crawler->>Crawler: Extract content (HTML/Markdown)
Crawler->>LLMStrategy: Process content
LLMStrategy->>LLMStrategy: Check content size
alt Content > chunk_threshold
LLMStrategy->>Chunker: Split into chunks with overlap
Chunker-->>LLMStrategy: Return chunks[]
loop For each chunk
LLMStrategy->>LLMProvider: Send chunk + schema + instruction
LLMProvider-->>LLMStrategy: Return structured JSON
end
LLMStrategy->>LLMStrategy: Merge chunk results
else Content <= threshold
LLMStrategy->>LLMProvider: Send full content + schema
LLMProvider-->>LLMStrategy: Return structured JSON
end
LLMStrategy->>Parser: Validate JSON schema
Parser-->>LLMStrategy: Validated data
LLMStrategy->>LLMStrategy: Track token usage
LLMStrategy-->>Crawler: Return extracted_content
Crawler-->>User: CrawlResult with JSON data
User->>LLMStrategy: show_usage()
LLMStrategy-->>User: Token count & estimated cost
```
### Schema-Based Extraction Architecture
```mermaid
graph TB
subgraph "Schema Definition"
A[JSON Schema] --> A1[baseSelector]
A --> A2[fields[]]
A --> A3[nested structures]
A2 --> A4[CSS/XPath selectors]
A2 --> A5[Data types: text, html, attribute]
A2 --> A6[Default values]
A3 --> A7[nested objects]
A3 --> A8[nested_list arrays]
A3 --> A9[simple lists]
end
subgraph "Extraction Engine"
B[HTML Content] --> C[Selector Engine]
C --> C1[CSS Selector Parser]
C --> C2[XPath Evaluator]
C1 --> D[Element Matcher]
C2 --> D
D --> E[Type Converter]
E --> E1[Text Extraction]
E --> E2[HTML Preservation]
E --> E3[Attribute Extraction]
E --> E4[Nested Processing]
end
subgraph "Result Processing"
F[Raw Extracted Data] --> G[Structure Builder]
G --> G1[Object Construction]
G --> G2[Array Assembly]
G --> G3[Type Validation]
G1 --> H[JSON Output]
G2 --> H
G3 --> H
end
A --> C
E --> F
H --> I[extracted_content]
style A fill:#e3f2fd
style C fill:#f3e5f5
style G fill:#e8f5e8
style H fill:#c8e6c9
```
### Automatic Schema Generation Process
```mermaid
stateDiagram-v2
[*] --> CheckCache
CheckCache --> CacheHit: Schema exists
CheckCache --> SamplePage: Schema missing
CacheHit --> LoadSchema
LoadSchema --> FastExtraction
SamplePage --> ExtractHTML: Crawl sample URL
ExtractHTML --> LLMAnalysis: Send HTML to LLM
LLMAnalysis --> GenerateSchema: Create CSS/XPath selectors
GenerateSchema --> ValidateSchema: Test generated schema
ValidateSchema --> SchemaWorks: Valid selectors
ValidateSchema --> RefineSchema: Invalid selectors
RefineSchema --> LLMAnalysis: Iterate with feedback
SchemaWorks --> CacheSchema: Save for reuse
CacheSchema --> FastExtraction: Use cached schema
FastExtraction --> [*]: No more LLM calls needed
note right of CheckCache : One-time LLM cost
note right of FastExtraction : Unlimited fast reuse
note right of CacheSchema : JSON file storage
```
### Multi-Strategy Extraction Pipeline
```mermaid
flowchart LR
A[Web Page Content] --> B[Strategy Pipeline]
subgraph B["Extraction Pipeline"]
B1[Stage 1: Regex Patterns]
B2[Stage 2: Schema-based CSS]
B3[Stage 3: LLM Analysis]
B1 --> B1a[Email addresses]
B1 --> B1b[Phone numbers]
B1 --> B1c[URLs and links]
B1 --> B1d[Currency amounts]
B2 --> B2a[Structured products]
B2 --> B2b[Article metadata]
B2 --> B2c[User reviews]
B2 --> B2d[Navigation links]
B3 --> B3a[Sentiment analysis]
B3 --> B3b[Key topics]
B3 --> B3c[Entity recognition]
B3 --> B3d[Content summary]
end
B1a --> C[Result Merger]
B1b --> C
B1c --> C
B1d --> C
B2a --> C
B2b --> C
B2c --> C
B2d --> C
B3a --> C
B3b --> C
B3c --> C
B3d --> C
C --> D[Combined JSON Output]
D --> E[Final CrawlResult]
style B1 fill:#c8e6c9
style B2 fill:#e3f2fd
style B3 fill:#fff3e0
style C fill:#f3e5f5
```
### Performance Comparison Matrix
```mermaid
graph TD
subgraph "Strategy Performance"
A[Extraction Strategy Comparison]
subgraph "Speed ⚡"
S1[Regex: ~10ms]
S2[CSS Schema: ~50ms]
S3[XPath: ~100ms]
S4[LLM: ~2-10s]
end
subgraph "Accuracy 🎯"
A1[Regex: Pattern-dependent]
A2[CSS: High for structured]
A3[XPath: Very high]
A4[LLM: Excellent for complex]
end
subgraph "Cost 💰"
C1[Regex: Free]
C2[CSS: Free]
C3[XPath: Free]
C4[LLM: $0.001-0.01 per page]
end
subgraph "Complexity 🔧"
X1[Regex: Simple patterns only]
X2[CSS: Structured HTML]
X3[XPath: Complex selectors]
X4[LLM: Any content type]
end
end
style S1 fill:#c8e6c9
style S2 fill:#e8f5e8
style S3 fill:#fff3e0
style S4 fill:#ffcdd2
style A2 fill:#e8f5e8
style A3 fill:#c8e6c9
style A4 fill:#c8e6c9
style C1 fill:#c8e6c9
style C2 fill:#c8e6c9
style C3 fill:#c8e6c9
style C4 fill:#fff3e0
style X1 fill:#ffcdd2
style X2 fill:#e8f5e8
style X3 fill:#c8e6c9
style X4 fill:#c8e6c9
```
### Regex Pattern Strategy Flow
```mermaid
flowchart TD
A[Regex Extraction] --> B{Pattern Source?}
B -->|Built-in| C[Use Predefined Patterns]
B -->|Custom| D[Define Custom Regex]
B -->|LLM-Generated| E[Generate with AI]
C --> C1[Email Pattern]
C --> C2[Phone Pattern]
C --> C3[URL Pattern]
C --> C4[Currency Pattern]
C --> C5[Date Pattern]
D --> D1[Write Custom Regex]
D --> D2[Test Pattern]
D --> D3{Pattern Works?}
D3 -->|No| D1
D3 -->|Yes| D4[Use Pattern]
E --> E1[Provide Sample Content]
E --> E2[LLM Analyzes Content]
E --> E3[Generate Optimized Regex]
E --> E4[Cache Pattern for Reuse]
C1 --> F[Pattern Matching]
C2 --> F
C3 --> F
C4 --> F
C5 --> F
D4 --> F
E4 --> F
F --> G[Extract Matches]
G --> H[Group by Pattern Type]
H --> I[JSON Output with Labels]
style C fill:#e8f5e8
style D fill:#e3f2fd
style E fill:#fff3e0
style F fill:#f3e5f5
```
### Complex Schema Structure Visualization
```mermaid
graph TB
subgraph "E-commerce Schema Example"
A[Category baseSelector] --> B[Category Fields]
A --> C[Products nested_list]
B --> B1[category_name]
B --> B2[category_id attribute]
B --> B3[category_url attribute]
C --> C1[Product baseSelector]
C1 --> C2[name text]
C1 --> C3[price text]
C1 --> C4[Details nested object]
C1 --> C5[Features list]
C1 --> C6[Reviews nested_list]
C4 --> C4a[brand text]
C4 --> C4b[model text]
C4 --> C4c[specs html]
C5 --> C5a[feature text array]
C6 --> C6a[reviewer text]
C6 --> C6b[rating attribute]
C6 --> C6c[comment text]
C6 --> C6d[date attribute]
end
subgraph "JSON Output Structure"
D[categories array] --> D1[category object]
D1 --> D2[category_name]
D1 --> D3[category_id]
D1 --> D4[products array]
D4 --> D5[product object]
D5 --> D6[name, price]
D5 --> D7[details object]
D5 --> D8[features array]
D5 --> D9[reviews array]
D7 --> D7a[brand, model, specs]
D8 --> D8a[feature strings]
D9 --> D9a[review objects]
end
A -.-> D
B1 -.-> D2
C2 -.-> D6
C4 -.-> D7
C5 -.-> D8
C6 -.-> D9
style A fill:#e3f2fd
style C fill:#f3e5f5
style C4 fill:#e8f5e8
style D fill:#fff3e0
```
### Error Handling and Fallback Strategy
```mermaid
stateDiagram-v2
[*] --> PrimaryStrategy
PrimaryStrategy --> Success: Extraction successful
PrimaryStrategy --> ValidationFailed: Invalid data
PrimaryStrategy --> ExtractionFailed: No matches found
PrimaryStrategy --> TimeoutError: LLM timeout
ValidationFailed --> FallbackStrategy: Try alternative
ExtractionFailed --> FallbackStrategy: Try alternative
TimeoutError --> FallbackStrategy: Try alternative
FallbackStrategy --> FallbackSuccess: Fallback works
FallbackStrategy --> FallbackFailed: All strategies failed
FallbackSuccess --> Success: Return results
FallbackFailed --> ErrorReport: Log failure details
Success --> [*]: Complete
ErrorReport --> [*]: Return empty results
note right of PrimaryStrategy : Try fastest/most accurate first
note right of FallbackStrategy : Use simpler but reliable method
note left of ErrorReport : Provide debugging information
```
### Token Usage and Cost Optimization
```mermaid
flowchart TD
A[LLM Extraction Request] --> B{Content Size Check}
B -->|Small < 1200 tokens| C[Single LLM Call]
B -->|Large > 1200 tokens| D[Chunking Strategy]
C --> C1[Send full content]
C1 --> C2[Parse JSON response]
C2 --> C3[Track token usage]
D --> D1[Split into chunks]
D1 --> D2[Add overlap between chunks]
D2 --> D3[Process chunks in parallel]
D3 --> D4[Chunk 1 → LLM]
D3 --> D5[Chunk 2 → LLM]
D3 --> D6[Chunk N → LLM]
D4 --> D7[Merge results]
D5 --> D7
D6 --> D7
D7 --> D8[Deduplicate data]
D8 --> D9[Aggregate token usage]
C3 --> E[Cost Calculation]
D9 --> E
E --> F[Usage Report]
F --> F1[Prompt tokens: X]
F --> F2[Completion tokens: Y]
F --> F3[Total cost: $Z]
style C fill:#c8e6c9
style D fill:#fff3e0
style E fill:#e3f2fd
style F fill:#f3e5f5
```
**📖 Learn more:** [LLM Strategies](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Pattern Matching](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)

View File

@@ -0,0 +1,478 @@
## Extraction Strategy Workflows and Architecture
Visual representations of Crawl4AI's data extraction approaches, strategy selection, and processing workflows.
### Extraction Strategy Decision Tree
```mermaid
flowchart TD
A[Content to Extract] --> B{Content Type?}
B -->|Simple Patterns| C[Common Data Types]
B -->|Structured HTML| D[Predictable Structure]
B -->|Complex Content| E[Requires Reasoning]
B -->|Mixed Content| F[Multiple Data Types]
C --> C1{Pattern Type?}
C1 -->|Email, Phone, URLs| C2[Built-in Regex Patterns]
C1 -->|Custom Patterns| C3[Custom Regex Strategy]
C1 -->|LLM-Generated| C4[One-time Pattern Generation]
D --> D1{Selector Type?}
D1 -->|CSS Selectors| D2[JsonCssExtractionStrategy]
D1 -->|XPath Expressions| D3[JsonXPathExtractionStrategy]
D1 -->|Need Schema?| D4[Auto-generate Schema with LLM]
E --> E1{LLM Provider?}
E1 -->|OpenAI/Anthropic| E2[Cloud LLM Strategy]
E1 -->|Local Ollama| E3[Local LLM Strategy]
E1 -->|Cost-sensitive| E4[Hybrid: Generate Schema Once]
F --> F1[Multi-Strategy Approach]
F1 --> F2[1. Regex for Patterns]
F1 --> F3[2. CSS for Structure]
F1 --> F4[3. LLM for Complex Analysis]
C2 --> G[Fast Extraction ⚡]
C3 --> G
C4 --> H[Cached Pattern Reuse]
D2 --> I[Schema-based Extraction 🏗️]
D3 --> I
D4 --> J[Generated Schema Cache]
E2 --> K[Intelligent Parsing 🧠]
E3 --> K
E4 --> L[Hybrid Cost-Effective]
F2 --> M[Comprehensive Results 📊]
F3 --> M
F4 --> M
style G fill:#c8e6c9
style I fill:#e3f2fd
style K fill:#fff3e0
style M fill:#f3e5f5
style H fill:#e8f5e8
style J fill:#e8f5e8
style L fill:#ffecb3
```
### LLM Extraction Strategy Workflow
```mermaid
sequenceDiagram
participant User
participant Crawler
participant LLMStrategy
participant Chunker
participant LLMProvider
participant Parser
User->>Crawler: Configure LLMExtractionStrategy
User->>Crawler: arun(url, config)
Crawler->>Crawler: Navigate to URL
Crawler->>Crawler: Extract content (HTML/Markdown)
Crawler->>LLMStrategy: Process content
LLMStrategy->>LLMStrategy: Check content size
alt Content > chunk_threshold
LLMStrategy->>Chunker: Split into chunks with overlap
Chunker-->>LLMStrategy: Return chunks[]
loop For each chunk
LLMStrategy->>LLMProvider: Send chunk + schema + instruction
LLMProvider-->>LLMStrategy: Return structured JSON
end
LLMStrategy->>LLMStrategy: Merge chunk results
else Content <= threshold
LLMStrategy->>LLMProvider: Send full content + schema
LLMProvider-->>LLMStrategy: Return structured JSON
end
LLMStrategy->>Parser: Validate JSON schema
Parser-->>LLMStrategy: Validated data
LLMStrategy->>LLMStrategy: Track token usage
LLMStrategy-->>Crawler: Return extracted_content
Crawler-->>User: CrawlResult with JSON data
User->>LLMStrategy: show_usage()
LLMStrategy-->>User: Token count & estimated cost
```
### Schema-Based Extraction Architecture
```mermaid
graph TB
subgraph "Schema Definition"
A[JSON Schema] --> A1[baseSelector]
A --> A2[fields[]]
A --> A3[nested structures]
A2 --> A4[CSS/XPath selectors]
A2 --> A5[Data types: text, html, attribute]
A2 --> A6[Default values]
A3 --> A7[nested objects]
A3 --> A8[nested_list arrays]
A3 --> A9[simple lists]
end
subgraph "Extraction Engine"
B[HTML Content] --> C[Selector Engine]
C --> C1[CSS Selector Parser]
C --> C2[XPath Evaluator]
C1 --> D[Element Matcher]
C2 --> D
D --> E[Type Converter]
E --> E1[Text Extraction]
E --> E2[HTML Preservation]
E --> E3[Attribute Extraction]
E --> E4[Nested Processing]
end
subgraph "Result Processing"
F[Raw Extracted Data] --> G[Structure Builder]
G --> G1[Object Construction]
G --> G2[Array Assembly]
G --> G3[Type Validation]
G1 --> H[JSON Output]
G2 --> H
G3 --> H
end
A --> C
E --> F
H --> I[extracted_content]
style A fill:#e3f2fd
style C fill:#f3e5f5
style G fill:#e8f5e8
style H fill:#c8e6c9
```
### Automatic Schema Generation Process
```mermaid
stateDiagram-v2
[*] --> CheckCache
CheckCache --> CacheHit: Schema exists
CheckCache --> SamplePage: Schema missing
CacheHit --> LoadSchema
LoadSchema --> FastExtraction
SamplePage --> ExtractHTML: Crawl sample URL
ExtractHTML --> LLMAnalysis: Send HTML to LLM
LLMAnalysis --> GenerateSchema: Create CSS/XPath selectors
GenerateSchema --> ValidateSchema: Test generated schema
ValidateSchema --> SchemaWorks: Valid selectors
ValidateSchema --> RefineSchema: Invalid selectors
RefineSchema --> LLMAnalysis: Iterate with feedback
SchemaWorks --> CacheSchema: Save for reuse
CacheSchema --> FastExtraction: Use cached schema
FastExtraction --> [*]: No more LLM calls needed
note right of CheckCache : One-time LLM cost
note right of FastExtraction : Unlimited fast reuse
note right of CacheSchema : JSON file storage
```
### Multi-Strategy Extraction Pipeline
```mermaid
flowchart LR
A[Web Page Content] --> B[Strategy Pipeline]
subgraph B["Extraction Pipeline"]
B1[Stage 1: Regex Patterns]
B2[Stage 2: Schema-based CSS]
B3[Stage 3: LLM Analysis]
B1 --> B1a[Email addresses]
B1 --> B1b[Phone numbers]
B1 --> B1c[URLs and links]
B1 --> B1d[Currency amounts]
B2 --> B2a[Structured products]
B2 --> B2b[Article metadata]
B2 --> B2c[User reviews]
B2 --> B2d[Navigation links]
B3 --> B3a[Sentiment analysis]
B3 --> B3b[Key topics]
B3 --> B3c[Entity recognition]
B3 --> B3d[Content summary]
end
B1a --> C[Result Merger]
B1b --> C
B1c --> C
B1d --> C
B2a --> C
B2b --> C
B2c --> C
B2d --> C
B3a --> C
B3b --> C
B3c --> C
B3d --> C
C --> D[Combined JSON Output]
D --> E[Final CrawlResult]
style B1 fill:#c8e6c9
style B2 fill:#e3f2fd
style B3 fill:#fff3e0
style C fill:#f3e5f5
```
### Performance Comparison Matrix
```mermaid
graph TD
subgraph "Strategy Performance"
A[Extraction Strategy Comparison]
subgraph "Speed ⚡"
S1[Regex: ~10ms]
S2[CSS Schema: ~50ms]
S3[XPath: ~100ms]
S4[LLM: ~2-10s]
end
subgraph "Accuracy 🎯"
A1[Regex: Pattern-dependent]
A2[CSS: High for structured]
A3[XPath: Very high]
A4[LLM: Excellent for complex]
end
subgraph "Cost 💰"
C1[Regex: Free]
C2[CSS: Free]
C3[XPath: Free]
C4[LLM: $0.001-0.01 per page]
end
subgraph "Complexity 🔧"
X1[Regex: Simple patterns only]
X2[CSS: Structured HTML]
X3[XPath: Complex selectors]
X4[LLM: Any content type]
end
end
style S1 fill:#c8e6c9
style S2 fill:#e8f5e8
style S3 fill:#fff3e0
style S4 fill:#ffcdd2
style A2 fill:#e8f5e8
style A3 fill:#c8e6c9
style A4 fill:#c8e6c9
style C1 fill:#c8e6c9
style C2 fill:#c8e6c9
style C3 fill:#c8e6c9
style C4 fill:#fff3e0
style X1 fill:#ffcdd2
style X2 fill:#e8f5e8
style X3 fill:#c8e6c9
style X4 fill:#c8e6c9
```
### Regex Pattern Strategy Flow
```mermaid
flowchart TD
A[Regex Extraction] --> B{Pattern Source?}
B -->|Built-in| C[Use Predefined Patterns]
B -->|Custom| D[Define Custom Regex]
B -->|LLM-Generated| E[Generate with AI]
C --> C1[Email Pattern]
C --> C2[Phone Pattern]
C --> C3[URL Pattern]
C --> C4[Currency Pattern]
C --> C5[Date Pattern]
D --> D1[Write Custom Regex]
D --> D2[Test Pattern]
D --> D3{Pattern Works?}
D3 -->|No| D1
D3 -->|Yes| D4[Use Pattern]
E --> E1[Provide Sample Content]
E --> E2[LLM Analyzes Content]
E --> E3[Generate Optimized Regex]
E --> E4[Cache Pattern for Reuse]
C1 --> F[Pattern Matching]
C2 --> F
C3 --> F
C4 --> F
C5 --> F
D4 --> F
E4 --> F
F --> G[Extract Matches]
G --> H[Group by Pattern Type]
H --> I[JSON Output with Labels]
style C fill:#e8f5e8
style D fill:#e3f2fd
style E fill:#fff3e0
style F fill:#f3e5f5
```
### Complex Schema Structure Visualization
```mermaid
graph TB
subgraph "E-commerce Schema Example"
A[Category baseSelector] --> B[Category Fields]
A --> C[Products nested_list]
B --> B1[category_name]
B --> B2[category_id attribute]
B --> B3[category_url attribute]
C --> C1[Product baseSelector]
C1 --> C2[name text]
C1 --> C3[price text]
C1 --> C4[Details nested object]
C1 --> C5[Features list]
C1 --> C6[Reviews nested_list]
C4 --> C4a[brand text]
C4 --> C4b[model text]
C4 --> C4c[specs html]
C5 --> C5a[feature text array]
C6 --> C6a[reviewer text]
C6 --> C6b[rating attribute]
C6 --> C6c[comment text]
C6 --> C6d[date attribute]
end
subgraph "JSON Output Structure"
D[categories array] --> D1[category object]
D1 --> D2[category_name]
D1 --> D3[category_id]
D1 --> D4[products array]
D4 --> D5[product object]
D5 --> D6[name, price]
D5 --> D7[details object]
D5 --> D8[features array]
D5 --> D9[reviews array]
D7 --> D7a[brand, model, specs]
D8 --> D8a[feature strings]
D9 --> D9a[review objects]
end
A -.-> D
B1 -.-> D2
C2 -.-> D6
C4 -.-> D7
C5 -.-> D8
C6 -.-> D9
style A fill:#e3f2fd
style C fill:#f3e5f5
style C4 fill:#e8f5e8
style D fill:#fff3e0
```
### Error Handling and Fallback Strategy
```mermaid
stateDiagram-v2
[*] --> PrimaryStrategy
PrimaryStrategy --> Success: Extraction successful
PrimaryStrategy --> ValidationFailed: Invalid data
PrimaryStrategy --> ExtractionFailed: No matches found
PrimaryStrategy --> TimeoutError: LLM timeout
ValidationFailed --> FallbackStrategy: Try alternative
ExtractionFailed --> FallbackStrategy: Try alternative
TimeoutError --> FallbackStrategy: Try alternative
FallbackStrategy --> FallbackSuccess: Fallback works
FallbackStrategy --> FallbackFailed: All strategies failed
FallbackSuccess --> Success: Return results
FallbackFailed --> ErrorReport: Log failure details
Success --> [*]: Complete
ErrorReport --> [*]: Return empty results
note right of PrimaryStrategy : Try fastest/most accurate first
note right of FallbackStrategy : Use simpler but reliable method
note left of ErrorReport : Provide debugging information
```
### Token Usage and Cost Optimization
```mermaid
flowchart TD
A[LLM Extraction Request] --> B{Content Size Check}
B -->|Small < 1200 tokens| C[Single LLM Call]
B -->|Large > 1200 tokens| D[Chunking Strategy]
C --> C1[Send full content]
C1 --> C2[Parse JSON response]
C2 --> C3[Track token usage]
D --> D1[Split into chunks]
D1 --> D2[Add overlap between chunks]
D2 --> D3[Process chunks in parallel]
D3 --> D4[Chunk 1 → LLM]
D3 --> D5[Chunk 2 → LLM]
D3 --> D6[Chunk N → LLM]
D4 --> D7[Merge results]
D5 --> D7
D6 --> D7
D7 --> D8[Deduplicate data]
D8 --> D9[Aggregate token usage]
C3 --> E[Cost Calculation]
D9 --> E
E --> F[Usage Report]
F --> F1[Prompt tokens: X]
F --> F2[Completion tokens: Y]
F --> F3[Total cost: $Z]
style C fill:#c8e6c9
style D fill:#fff3e0
style E fill:#e3f2fd
style F fill:#f3e5f5
```
**📖 Learn more:** [LLM Strategies](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Pattern Matching](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)

View File

@@ -0,0 +1,472 @@
## HTTP Crawler Strategy Workflows
Visual representations of HTTP-based crawling architecture, request flows, and performance characteristics compared to browser-based strategies.
### HTTP vs Browser Strategy Decision Tree
```mermaid
flowchart TD
A[Content Crawling Need] --> B{Content Type Analysis}
B -->|Static HTML| C{JavaScript Required?}
B -->|Dynamic SPA| D[Browser Strategy Required]
B -->|API Endpoints| E[HTTP Strategy Optimal]
B -->|Mixed Content| F{Primary Content Source?}
C -->|No JS Needed| G[HTTP Strategy Recommended]
C -->|JS Required| H[Browser Strategy Required]
C -->|Unknown| I{Performance Priority?}
I -->|Speed Critical| J[Try HTTP First]
I -->|Accuracy Critical| K[Use Browser Strategy]
F -->|Mostly Static| G
F -->|Mostly Dynamic| D
G --> L{Resource Constraints?}
L -->|Memory Limited| M[HTTP Strategy - Lightweight]
L -->|CPU Limited| N[HTTP Strategy - No Browser]
L -->|Network Limited| O[HTTP Strategy - Efficient]
L -->|No Constraints| P[Either Strategy Works]
J --> Q[Test HTTP Results]
Q --> R{Content Complete?}
R -->|Yes| S[Continue with HTTP]
R -->|No| T[Switch to Browser Strategy]
D --> U[Browser Strategy Features]
H --> U
K --> U
T --> U
U --> V[JavaScript Execution]
U --> W[Screenshots/PDFs]
U --> X[Complex Interactions]
U --> Y[Session Management]
M --> Z[HTTP Strategy Benefits]
N --> Z
O --> Z
S --> Z
Z --> AA[10x Faster Processing]
Z --> BB[Lower Memory Usage]
Z --> CC[Higher Concurrency]
Z --> DD[Simpler Deployment]
style G fill:#c8e6c9
style M fill:#c8e6c9
style N fill:#c8e6c9
style O fill:#c8e6c9
style S fill:#c8e6c9
style D fill:#e3f2fd
style H fill:#e3f2fd
style K fill:#e3f2fd
style T fill:#e3f2fd
style U fill:#e3f2fd
```
### HTTP Request Lifecycle Sequence
```mermaid
sequenceDiagram
participant Client
participant HTTPStrategy as HTTP Strategy
participant Session as HTTP Session
participant Server as Target Server
participant Processor as Content Processor
Client->>HTTPStrategy: crawl(url, config)
HTTPStrategy->>HTTPStrategy: validate_url()
alt URL Type Check
HTTPStrategy->>HTTPStrategy: handle_file_url()
Note over HTTPStrategy: file:// URLs
else
HTTPStrategy->>HTTPStrategy: handle_raw_content()
Note over HTTPStrategy: raw:// content
else
HTTPStrategy->>Session: prepare_request()
Session->>Session: apply_config()
Session->>Session: set_headers()
Session->>Session: setup_auth()
Session->>Server: HTTP Request
Note over Session,Server: GET/POST/PUT with headers
alt Success Response
Server-->>Session: HTTP 200 + Content
Session-->>HTTPStrategy: response_data
else Redirect Response
Server-->>Session: HTTP 3xx + Location
Session->>Server: Follow redirect
Server-->>Session: HTTP 200 + Content
Session-->>HTTPStrategy: final_response
else Error Response
Server-->>Session: HTTP 4xx/5xx
Session-->>HTTPStrategy: error_response
end
end
HTTPStrategy->>Processor: process_content()
Processor->>Processor: clean_html()
Processor->>Processor: extract_metadata()
Processor->>Processor: generate_markdown()
Processor-->>HTTPStrategy: processed_result
HTTPStrategy-->>Client: CrawlResult
Note over Client,Processor: Fast, lightweight processing
Note over HTTPStrategy: No browser overhead
```
### HTTP Strategy Architecture
```mermaid
graph TB
subgraph "HTTP Crawler Strategy"
A[AsyncHTTPCrawlerStrategy] --> B[Session Manager]
A --> C[Request Builder]
A --> D[Response Handler]
A --> E[Error Manager]
B --> B1[Connection Pool]
B --> B2[DNS Cache]
B --> B3[SSL Context]
C --> C1[Headers Builder]
C --> C2[Auth Handler]
C --> C3[Payload Encoder]
D --> D1[Content Decoder]
D --> D2[Redirect Handler]
D --> D3[Status Validator]
E --> E1[Retry Logic]
E --> E2[Timeout Handler]
E --> E3[Exception Mapper]
end
subgraph "Content Processing"
F[Raw HTML] --> G[HTML Cleaner]
G --> H[Markdown Generator]
H --> I[Link Extractor]
I --> J[Media Extractor]
J --> K[Metadata Parser]
end
subgraph "External Resources"
L[Target Websites]
M[Local Files]
N[Raw Content]
end
subgraph "Output"
O[CrawlResult]
O --> O1[HTML Content]
O --> O2[Markdown Text]
O --> O3[Extracted Links]
O --> O4[Media References]
O --> O5[Status Information]
end
A --> F
L --> A
M --> A
N --> A
K --> O
style A fill:#e3f2fd
style B fill:#f3e5f5
style F fill:#e8f5e8
style O fill:#fff3e0
```
### Performance Comparison Flow
```mermaid
graph LR
subgraph "HTTP Strategy Performance"
A1[Request Start] --> A2[DNS Lookup: 50ms]
A2 --> A3[TCP Connect: 100ms]
A3 --> A4[HTTP Request: 200ms]
A4 --> A5[Content Download: 300ms]
A5 --> A6[Processing: 50ms]
A6 --> A7[Total: ~700ms]
end
subgraph "Browser Strategy Performance"
B1[Request Start] --> B2[Browser Launch: 2000ms]
B2 --> B3[Page Navigation: 1000ms]
B3 --> B4[JS Execution: 500ms]
B4 --> B5[Content Rendering: 300ms]
B5 --> B6[Processing: 100ms]
B6 --> B7[Total: ~3900ms]
end
subgraph "Resource Usage"
C1[HTTP Memory: ~50MB]
C2[Browser Memory: ~500MB]
C3[HTTP CPU: Low]
C4[Browser CPU: High]
C5[HTTP Concurrency: 100+]
C6[Browser Concurrency: 10-20]
end
A7 --> D[5.5x Faster]
B7 --> D
C1 --> E[10x Less Memory]
C2 --> E
C5 --> F[5x More Concurrent]
C6 --> F
style A7 fill:#c8e6c9
style B7 fill:#ffcdd2
style C1 fill:#c8e6c9
style C2 fill:#ffcdd2
style C5 fill:#c8e6c9
style C6 fill:#ffcdd2
```
### HTTP Request Types and Configuration
```mermaid
stateDiagram-v2
[*] --> HTTPConfigSetup
HTTPConfigSetup --> MethodSelection
MethodSelection --> GET: Simple data retrieval
MethodSelection --> POST: Form submission
MethodSelection --> PUT: Data upload
MethodSelection --> DELETE: Resource removal
GET --> HeaderSetup: Set Accept headers
POST --> PayloadSetup: JSON or form data
PUT --> PayloadSetup: File or data upload
DELETE --> AuthSetup: Authentication required
PayloadSetup --> JSONPayload: application/json
PayloadSetup --> FormPayload: form-data
PayloadSetup --> RawPayload: custom content
JSONPayload --> HeaderSetup
FormPayload --> HeaderSetup
RawPayload --> HeaderSetup
HeaderSetup --> AuthSetup
AuthSetup --> SSLSetup
SSLSetup --> RedirectSetup
RedirectSetup --> RequestExecution
RequestExecution --> [*]: Request complete
note right of GET : Default method for most crawling
note right of POST : API interactions, form submissions
note right of JSONPayload : Structured data transmission
note right of HeaderSetup : User-Agent, Accept, Custom headers
```
### Error Handling and Retry Workflow
```mermaid
flowchart TD
A[HTTP Request] --> B{Response Received?}
B -->|No| C[Connection Error]
B -->|Yes| D{Status Code Check}
C --> C1{Timeout Error?}
C1 -->|Yes| C2[ConnectionTimeoutError]
C1 -->|No| C3[Network Error]
D -->|2xx| E[Success Response]
D -->|3xx| F[Redirect Response]
D -->|4xx| G[Client Error]
D -->|5xx| H[Server Error]
F --> F1{Follow Redirects?}
F1 -->|Yes| F2[Follow Redirect]
F1 -->|No| F3[Return Redirect Response]
F2 --> A
G --> G1{Retry on 4xx?}
G1 -->|No| G2[HTTPStatusError]
G1 -->|Yes| I[Check Retry Count]
H --> H1{Retry on 5xx?}
H1 -->|Yes| I
H1 -->|No| H2[HTTPStatusError]
C2 --> I
C3 --> I
I --> J{Retries < Max?}
J -->|No| K[Final Error]
J -->|Yes| L[Calculate Backoff]
L --> M[Wait Backoff Time]
M --> N[Increment Retry Count]
N --> A
E --> O[Process Content]
F3 --> O
O --> P[Return CrawlResult]
G2 --> Q[Error CrawlResult]
H2 --> Q
K --> Q
style E fill:#c8e6c9
style P fill:#c8e6c9
style G2 fill:#ffcdd2
style H2 fill:#ffcdd2
style K fill:#ffcdd2
style Q fill:#ffcdd2
```
### Batch Processing Architecture
```mermaid
sequenceDiagram
participant Client
participant BatchManager as Batch Manager
participant HTTPPool as Connection Pool
participant Workers as HTTP Workers
participant Targets as Target Servers
Client->>BatchManager: batch_crawl(urls)
BatchManager->>BatchManager: create_semaphore(max_concurrent)
loop For each URL batch
BatchManager->>HTTPPool: acquire_connection()
HTTPPool->>Workers: assign_worker()
par Concurrent Processing
Workers->>Targets: HTTP Request 1
Workers->>Targets: HTTP Request 2
Workers->>Targets: HTTP Request N
end
par Response Handling
Targets-->>Workers: Response 1
Targets-->>Workers: Response 2
Targets-->>Workers: Response N
end
Workers->>HTTPPool: return_connection()
HTTPPool->>BatchManager: batch_results()
end
BatchManager->>BatchManager: aggregate_results()
BatchManager-->>Client: final_results()
Note over Workers,Targets: 20-100 concurrent connections
Note over BatchManager: Memory-efficient processing
Note over HTTPPool: Connection reuse optimization
```
### Content Type Processing Pipeline
```mermaid
graph TD
A[HTTP Response] --> B{Content-Type Detection}
B -->|text/html| C[HTML Processing]
B -->|application/json| D[JSON Processing]
B -->|text/plain| E[Text Processing]
B -->|application/xml| F[XML Processing]
B -->|Other| G[Binary Processing]
C --> C1[Parse HTML Structure]
C1 --> C2[Extract Text Content]
C2 --> C3[Generate Markdown]
C3 --> C4[Extract Links/Media]
D --> D1[Parse JSON Structure]
D1 --> D2[Extract Data Fields]
D2 --> D3[Format as Readable Text]
E --> E1[Clean Text Content]
E1 --> E2[Basic Formatting]
F --> F1[Parse XML Structure]
F1 --> F2[Extract Text Nodes]
F2 --> F3[Convert to Markdown]
G --> G1[Save Binary Content]
G1 --> G2[Generate Metadata]
C4 --> H[Content Analysis]
D3 --> H
E2 --> H
F3 --> H
G2 --> H
H --> I[Link Extraction]
H --> J[Media Detection]
H --> K[Metadata Parsing]
I --> L[CrawlResult Assembly]
J --> L
K --> L
L --> M[Final Output]
style C fill:#e8f5e8
style H fill:#fff3e0
style L fill:#e3f2fd
style M fill:#c8e6c9
```
### Integration with Processing Strategies
```mermaid
graph LR
subgraph "HTTP Strategy Core"
A[HTTP Request] --> B[Raw Content]
B --> C[Content Decoder]
end
subgraph "Processing Pipeline"
C --> D[HTML Cleaner]
D --> E[Markdown Generator]
E --> F{Content Filter?}
F -->|Yes| G[Pruning Filter]
F -->|Yes| H[BM25 Filter]
F -->|No| I[Raw Markdown]
G --> J[Fit Markdown]
H --> J
end
subgraph "Extraction Strategies"
I --> K[CSS Extraction]
J --> K
I --> L[XPath Extraction]
J --> L
I --> M[LLM Extraction]
J --> M
end
subgraph "Output Generation"
K --> N[Structured JSON]
L --> N
M --> N
I --> O[Clean Markdown]
J --> P[Filtered Content]
N --> Q[Final CrawlResult]
O --> Q
P --> Q
end
style A fill:#e3f2fd
style C fill:#f3e5f5
style E fill:#e8f5e8
style Q fill:#c8e6c9
```
**📖 Learn more:** [HTTP vs Browser Strategies](https://docs.crawl4ai.com/core/browser-crawler-config/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Error Handling](https://docs.crawl4ai.com/api/async-webcrawler/)

View File

@@ -0,0 +1,368 @@
## Installation Workflows and Architecture
Visual representations of Crawl4AI installation processes, deployment options, and system interactions.
### Installation Decision Flow
```mermaid
flowchart TD
A[Start Installation] --> B{Environment Type?}
B -->|Local Development| C[Basic Python Install]
B -->|Production| D[Docker Deployment]
B -->|Research/Testing| E[Google Colab]
B -->|CI/CD Pipeline| F[Automated Setup]
C --> C1[pip install crawl4ai]
C1 --> C2[crawl4ai-setup]
C2 --> C3{Need Advanced Features?}
C3 -->|No| C4[Basic Installation Complete]
C3 -->|Text Clustering| C5[pip install crawl4ai with torch]
C3 -->|Transformers| C6[pip install crawl4ai with transformer]
C3 -->|All Features| C7[pip install crawl4ai with all]
C5 --> C8[crawl4ai-download-models]
C6 --> C8
C7 --> C8
C8 --> C9[Advanced Installation Complete]
D --> D1{Deployment Method?}
D1 -->|Pre-built Image| D2[docker pull unclecode/crawl4ai]
D1 -->|Docker Compose| D3[Clone repo + docker compose]
D1 -->|Custom Build| D4[docker buildx build]
D2 --> D5[Configure .llm.env]
D3 --> D5
D4 --> D5
D5 --> D6[docker run with ports]
D6 --> D7[Docker Deployment Complete]
E --> E1[Colab pip install]
E1 --> E2[playwright install chromium]
E2 --> E3[Test basic crawl]
E3 --> E4[Colab Setup Complete]
F --> F1[Automated pip install]
F1 --> F2[Automated setup scripts]
F2 --> F3[CI/CD Integration Complete]
C4 --> G[Verify with crawl4ai-doctor]
C9 --> G
D7 --> H[Health check via API]
E4 --> I[Run test crawl]
F3 --> G
G --> J[Installation Verified]
H --> J
I --> J
style A fill:#e1f5fe
style J fill:#c8e6c9
style C4 fill:#fff3e0
style C9 fill:#fff3e0
style D7 fill:#f3e5f5
style E4 fill:#fce4ec
style F3 fill:#e8f5e8
```
### Basic Installation Sequence
```mermaid
sequenceDiagram
participant User
participant PyPI
participant System
participant Playwright
participant Crawler
User->>PyPI: pip install crawl4ai
PyPI-->>User: Package downloaded
User->>System: crawl4ai-setup
System->>Playwright: Install browser binaries
Playwright-->>System: Chromium, Firefox installed
System-->>User: Setup complete
User->>System: crawl4ai-doctor
System->>System: Check Python version
System->>System: Verify Playwright installation
System->>System: Test browser launch
System-->>User: Diagnostics report
User->>Crawler: Basic crawl test
Crawler->>Playwright: Launch browser
Playwright-->>Crawler: Browser ready
Crawler->>Crawler: Navigate to test URL
Crawler-->>User: Success confirmation
```
### Docker Deployment Architecture
```mermaid
graph TB
subgraph "Host System"
A[Docker Engine] --> B[Crawl4AI Container]
C[.llm.env File] --> B
D[Port 11235] --> B
end
subgraph "Container Environment"
B --> E[FastAPI Server]
B --> F[Playwright Browsers]
B --> G[Python Runtime]
E --> H[/crawl Endpoint]
E --> I[/playground Interface]
E --> J[/health Monitoring]
E --> K[/metrics Prometheus]
F --> L[Chromium Browser]
F --> M[Firefox Browser]
F --> N[WebKit Browser]
end
subgraph "External Services"
O[OpenAI API] --> B
P[Anthropic API] --> B
Q[Local LLM Ollama] --> B
end
subgraph "Client Applications"
R[Python SDK] --> H
S[REST API Calls] --> H
T[Web Browser] --> I
U[Monitoring Tools] --> J
V[Prometheus] --> K
end
style B fill:#e3f2fd
style E fill:#f3e5f5
style F fill:#e8f5e8
style G fill:#fff3e0
```
### Advanced Features Installation Flow
```mermaid
stateDiagram-v2
[*] --> BasicInstall
BasicInstall --> FeatureChoice: crawl4ai installed
FeatureChoice --> TorchInstall: Need text clustering
FeatureChoice --> TransformerInstall: Need HuggingFace models
FeatureChoice --> AllInstall: Need everything
FeatureChoice --> Complete: Basic features sufficient
TorchInstall --> TorchSetup: pip install crawl4ai with torch
TransformerInstall --> TransformerSetup: pip install crawl4ai with transformer
AllInstall --> AllSetup: pip install crawl4ai with all
TorchSetup --> ModelDownload: crawl4ai-setup
TransformerSetup --> ModelDownload: crawl4ai-setup
AllSetup --> ModelDownload: crawl4ai-setup
ModelDownload --> PreDownload: crawl4ai-download-models
PreDownload --> Complete: All models cached
Complete --> Verification: crawl4ai-doctor
Verification --> [*]: Installation verified
note right of TorchInstall : PyTorch for semantic operations
note right of TransformerInstall : HuggingFace for LLM features
note right of AllInstall : Complete feature set
```
### Platform-Specific Installation Matrix
```mermaid
graph LR
subgraph "Installation Methods"
A[Python Package] --> A1[pip install]
B[Docker Image] --> B1[docker pull]
C[Source Build] --> C1[git clone + build]
D[Cloud Platform] --> D1[Colab/Kaggle]
end
subgraph "Operating Systems"
E[Linux x86_64]
F[Linux ARM64]
G[macOS Intel]
H[macOS Apple Silicon]
I[Windows x86_64]
end
subgraph "Feature Sets"
J[Basic crawling]
K[Text clustering torch]
L[LLM transformers]
M[All features]
end
A1 --> E
A1 --> F
A1 --> G
A1 --> H
A1 --> I
B1 --> E
B1 --> F
B1 --> G
B1 --> H
C1 --> E
C1 --> F
C1 --> G
C1 --> H
C1 --> I
D1 --> E
D1 --> I
E --> J
E --> K
E --> L
E --> M
F --> J
F --> K
F --> L
F --> M
G --> J
G --> K
G --> L
G --> M
H --> J
H --> K
H --> L
H --> M
I --> J
I --> K
I --> L
I --> M
style A1 fill:#e3f2fd
style B1 fill:#f3e5f5
style C1 fill:#e8f5e8
style D1 fill:#fff3e0
```
### Docker Multi-Stage Build Process
```mermaid
sequenceDiagram
participant Dev as Developer
participant Git as GitHub Repo
participant Docker as Docker Engine
participant Registry as Docker Hub
participant User as End User
Dev->>Git: Push code changes
Docker->>Git: Clone repository
Docker->>Docker: Stage 1 - Base Python image
Docker->>Docker: Stage 2 - Install dependencies
Docker->>Docker: Stage 3 - Install Playwright
Docker->>Docker: Stage 4 - Copy application code
Docker->>Docker: Stage 5 - Setup FastAPI server
Note over Docker: Multi-architecture build
Docker->>Docker: Build for linux/amd64
Docker->>Docker: Build for linux/arm64
Docker->>Registry: Push multi-arch manifest
Registry-->>Docker: Build complete
User->>Registry: docker pull unclecode/crawl4ai
Registry-->>User: Download appropriate architecture
User->>Docker: docker run with configuration
Docker->>Docker: Start container
Docker->>Docker: Initialize FastAPI server
Docker->>Docker: Setup Playwright browsers
Docker-->>User: Service ready on port 11235
```
### Installation Verification Workflow
```mermaid
flowchart TD
A[Installation Complete] --> B[Run crawl4ai-doctor]
B --> C{Python Version Check}
C -->|✓ 3.10+| D{Playwright Check}
C -->|✗ < 3.10| C1[Upgrade Python]
C1 --> D
D -->|✓ Installed| E{Browser Binaries}
D -->|✗ Missing| D1[Run crawl4ai-setup]
D1 --> E
E -->|✓ Available| F{Test Browser Launch}
E -->|✗ Missing| E1[playwright install]
E1 --> F
F -->|✓ Success| G[Test Basic Crawl]
F -->|✗ Failed| F1[Check system dependencies]
F1 --> F
G --> H{Crawl Test Result}
H -->|✓ Success| I[Installation Verified ✓]
H -->|✗ Failed| H1[Check network/permissions]
H1 --> G
I --> J[Ready for Production Use]
style I fill:#c8e6c9
style J fill:#e8f5e8
style C1 fill:#ffcdd2
style D1 fill:#fff3e0
style E1 fill:#fff3e0
style F1 fill:#ffcdd2
style H1 fill:#ffcdd2
```
### Resource Requirements by Installation Type
```mermaid
graph TD
subgraph "Basic Installation"
A1[Memory: 512MB]
A2[Disk: 2GB]
A3[CPU: 1 core]
A4[Network: Required for setup]
end
subgraph "Advanced Features torch"
B1[Memory: 2GB+]
B2[Disk: 5GB+]
B3[CPU: 2+ cores]
B4[GPU: Optional CUDA]
end
subgraph "All Features"
C1[Memory: 4GB+]
C2[Disk: 10GB+]
C3[CPU: 4+ cores]
C4[GPU: Recommended]
end
subgraph "Docker Deployment"
D1[Memory: 1GB+]
D2[Disk: 3GB+]
D3[CPU: 2+ cores]
D4[Ports: 11235]
D5[Shared Memory: 1GB]
end
style A1 fill:#e8f5e8
style B1 fill:#fff3e0
style C1 fill:#ffecb3
style D1 fill:#e3f2fd
```
**📖 Learn more:** [Installation Guide](https://docs.crawl4ai.com/core/installation/), [Docker Deployment](https://docs.crawl4ai.com/core/docker-deployment/), [System Requirements](https://docs.crawl4ai.com/core/installation/#prerequisites)

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,392 @@
## Multi-URL Crawling Workflows and Architecture
Visual representations of concurrent crawling patterns, resource management, and monitoring systems for handling multiple URLs efficiently.
### Multi-URL Processing Modes
```mermaid
flowchart TD
A[Multi-URL Crawling Request] --> B{Processing Mode?}
B -->|Batch Mode| C[Collect All URLs]
B -->|Streaming Mode| D[Process URLs Individually]
C --> C1[Queue All URLs]
C1 --> C2[Execute Concurrently]
C2 --> C3[Wait for All Completion]
C3 --> C4[Return Complete Results Array]
D --> D1[Queue URLs]
D1 --> D2[Start First Batch]
D2 --> D3[Yield Results as Available]
D3 --> D4{More URLs?}
D4 -->|Yes| D5[Start Next URLs]
D4 -->|No| D6[Stream Complete]
D5 --> D3
C4 --> E[Process Results]
D6 --> E
E --> F[Success/Failure Analysis]
F --> G[End]
style C fill:#e3f2fd
style D fill:#f3e5f5
style C4 fill:#c8e6c9
style D6 fill:#c8e6c9
```
### Memory-Adaptive Dispatcher Flow
```mermaid
stateDiagram-v2
[*] --> Initializing
Initializing --> MonitoringMemory: Start dispatcher
MonitoringMemory --> CheckingMemory: Every check_interval
CheckingMemory --> MemoryOK: Memory < threshold
CheckingMemory --> MemoryHigh: Memory >= threshold
MemoryOK --> DispatchingTasks: Start new crawls
MemoryHigh --> WaitingForMemory: Pause dispatching
DispatchingTasks --> TaskRunning: Launch crawler
TaskRunning --> TaskCompleted: Crawl finished
TaskRunning --> TaskFailed: Crawl error
TaskCompleted --> MonitoringMemory: Update stats
TaskFailed --> MonitoringMemory: Update stats
WaitingForMemory --> CheckingMemory: Wait timeout
WaitingForMemory --> MonitoringMemory: Memory freed
note right of MemoryHigh: Prevents OOM crashes
note right of DispatchingTasks: Respects max_session_permit
note right of WaitingForMemory: Configurable timeout
```
### Concurrent Crawling Architecture
```mermaid
graph TB
subgraph "URL Queue Management"
A[URL Input List] --> B[URL Queue]
B --> C[Priority Scheduler]
C --> D[Batch Assignment]
end
subgraph "Dispatcher Layer"
E[Memory Adaptive Dispatcher]
F[Semaphore Dispatcher]
G[Rate Limiter]
H[Resource Monitor]
E --> I[Memory Checker]
F --> J[Concurrency Controller]
G --> K[Delay Calculator]
H --> L[System Stats]
end
subgraph "Crawler Pool"
M[Crawler Instance 1]
N[Crawler Instance 2]
O[Crawler Instance 3]
P[Crawler Instance N]
M --> Q[Browser Session 1]
N --> R[Browser Session 2]
O --> S[Browser Session 3]
P --> T[Browser Session N]
end
subgraph "Result Processing"
U[Result Collector]
V[Success Handler]
W[Error Handler]
X[Retry Queue]
Y[Final Results]
end
D --> E
D --> F
E --> M
F --> N
G --> O
H --> P
Q --> U
R --> U
S --> U
T --> U
U --> V
U --> W
W --> X
X --> B
V --> Y
style E fill:#e3f2fd
style F fill:#f3e5f5
style G fill:#e8f5e8
style H fill:#fff3e0
```
### Rate Limiting and Backoff Strategy
```mermaid
sequenceDiagram
participant C as Crawler
participant RL as Rate Limiter
participant S as Server
participant D as Dispatcher
C->>RL: Request to crawl URL
RL->>RL: Calculate delay
RL->>RL: Apply base delay (1-3s)
RL->>C: Delay applied
C->>S: HTTP Request
alt Success Response
S-->>C: 200 OK + Content
C->>RL: Report success
RL->>RL: Reset failure count
C->>D: Return successful result
else Rate Limited
S-->>C: 429 Too Many Requests
C->>RL: Report rate limit
RL->>RL: Exponential backoff
RL->>RL: Increase delay (up to max_delay)
RL->>C: Apply longer delay
C->>S: Retry request after delay
else Server Error
S-->>C: 503 Service Unavailable
C->>RL: Report server error
RL->>RL: Moderate backoff
RL->>C: Retry with backoff
else Max Retries Exceeded
RL->>C: Stop retrying
C->>D: Return failed result
end
```
### Large-Scale Crawling Workflow
```mermaid
flowchart TD
A[Load URL List 10k+ URLs] --> B[Initialize Dispatcher]
B --> C{Select Dispatcher Type}
C -->|Memory Constrained| D[Memory Adaptive]
C -->|Fixed Resources| E[Semaphore Based]
D --> F[Set Memory Threshold 70%]
E --> G[Set Concurrency Limit]
F --> H[Configure Monitoring]
G --> H
H --> I[Start Crawling Process]
I --> J[Monitor System Resources]
J --> K{Memory Usage?}
K -->|< Threshold| L[Continue Dispatching]
K -->|>= Threshold| M[Pause New Tasks]
L --> N[Process Results Stream]
M --> O[Wait for Memory]
O --> K
N --> P{Result Type?}
P -->|Success| Q[Save to Database]
P -->|Failure| R[Log Error]
Q --> S[Update Progress Counter]
R --> S
S --> T{More URLs?}
T -->|Yes| U[Get Next Batch]
T -->|No| V[Generate Final Report]
U --> L
V --> W[Analysis Complete]
style A fill:#e1f5fe
style D fill:#e8f5e8
style E fill:#f3e5f5
style V fill:#c8e6c9
style W fill:#a5d6a7
```
### Real-Time Monitoring Dashboard Flow
```mermaid
graph LR
subgraph "Data Collection"
A[Crawler Tasks] --> B[Performance Metrics]
A --> C[Memory Usage]
A --> D[Success/Failure Rates]
A --> E[Response Times]
end
subgraph "Monitor Processing"
F[CrawlerMonitor] --> G[Aggregate Statistics]
F --> H[Display Formatter]
F --> I[Update Scheduler]
end
subgraph "Display Modes"
J[DETAILED Mode]
K[AGGREGATED Mode]
J --> L[Individual Task Status]
J --> M[Task-Level Metrics]
K --> N[Summary Statistics]
K --> O[Overall Progress]
end
subgraph "Output Interface"
P[Console Display]
Q[Progress Bars]
R[Status Tables]
S[Real-time Updates]
end
B --> F
C --> F
D --> F
E --> F
G --> J
G --> K
H --> J
H --> K
I --> J
I --> K
L --> P
M --> Q
N --> R
O --> S
style F fill:#e3f2fd
style J fill:#f3e5f5
style K fill:#e8f5e8
```
### Error Handling and Recovery Pattern
```mermaid
stateDiagram-v2
[*] --> ProcessingURL
ProcessingURL --> CrawlAttempt: Start crawl
CrawlAttempt --> Success: HTTP 200
CrawlAttempt --> NetworkError: Connection failed
CrawlAttempt --> RateLimit: HTTP 429
CrawlAttempt --> ServerError: HTTP 5xx
CrawlAttempt --> Timeout: Request timeout
Success --> [*]: Return result
NetworkError --> RetryCheck: Check retry count
RateLimit --> BackoffWait: Apply exponential backoff
ServerError --> RetryCheck: Check retry count
Timeout --> RetryCheck: Check retry count
BackoffWait --> RetryCheck: After delay
RetryCheck --> CrawlAttempt: retries < max_retries
RetryCheck --> Failed: retries >= max_retries
Failed --> ErrorLog: Log failure details
ErrorLog --> [*]: Return failed result
note right of BackoffWait: Exponential backoff for rate limits
note right of RetryCheck: Configurable max_retries
note right of ErrorLog: Detailed error tracking
```
### Resource Management Timeline
```mermaid
gantt
title Multi-URL Crawling Resource Management
dateFormat X
axisFormat %s
section Memory Usage
Initialize Dispatcher :0, 1
Memory Monitoring :1, 10
Peak Usage Period :3, 7
Memory Cleanup :7, 9
section Task Execution
URL Queue Setup :0, 2
Batch 1 Processing :2, 5
Batch 2 Processing :4, 7
Batch 3 Processing :6, 9
Final Results :9, 10
section Rate Limiting
Normal Delays :2, 4
Backoff Period :4, 6
Recovery Period :6, 8
section Monitoring
System Health Check :0, 10
Progress Updates :1, 9
Performance Metrics :2, 8
```
### Concurrent Processing Performance Matrix
```mermaid
graph TD
subgraph "Input Factors"
A[Number of URLs]
B[Concurrency Level]
C[Memory Threshold]
D[Rate Limiting]
end
subgraph "Processing Characteristics"
A --> E[Low 1-100 URLs]
A --> F[Medium 100-1k URLs]
A --> G[High 1k-10k URLs]
A --> H[Very High 10k+ URLs]
B --> I[Conservative 1-5]
B --> J[Moderate 5-15]
B --> K[Aggressive 15-30]
C --> L[Strict 60-70%]
C --> M[Balanced 70-80%]
C --> N[Relaxed 80-90%]
end
subgraph "Recommended Configurations"
E --> O[Simple Semaphore]
F --> P[Memory Adaptive Basic]
G --> Q[Memory Adaptive Advanced]
H --> R[Memory Adaptive + Monitoring]
I --> O
J --> P
K --> Q
K --> R
L --> Q
M --> P
N --> O
end
style O fill:#c8e6c9
style P fill:#fff3e0
style Q fill:#ffecb3
style R fill:#ffcdd2
```
**📖 Learn more:** [Multi-URL Crawling Guide](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Dispatcher Configuration](https://docs.crawl4ai.com/advanced/crawl-dispatcher/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/#performance-optimization)

View File

@@ -0,0 +1,411 @@
## Simple Crawling Workflows and Data Flow
Visual representations of basic web crawling operations, configuration patterns, and result processing workflows.
### Basic Crawling Sequence
```mermaid
sequenceDiagram
participant User
participant Crawler as AsyncWebCrawler
participant Browser as Browser Instance
participant Page as Web Page
participant Processor as Content Processor
User->>Crawler: Create with BrowserConfig
Crawler->>Browser: Launch browser instance
Browser-->>Crawler: Browser ready
User->>Crawler: arun(url, CrawlerRunConfig)
Crawler->>Browser: Create new page/context
Browser->>Page: Navigate to URL
Page-->>Browser: Page loaded
Browser->>Processor: Extract raw HTML
Processor->>Processor: Clean HTML
Processor->>Processor: Generate markdown
Processor->>Processor: Extract media/links
Processor-->>Crawler: CrawlResult created
Crawler-->>User: Return CrawlResult
Note over User,Processor: All processing happens asynchronously
```
### Crawling Configuration Flow
```mermaid
flowchart TD
A[Start Crawling] --> B{Browser Config Set?}
B -->|No| B1[Use Default BrowserConfig]
B -->|Yes| B2[Custom BrowserConfig]
B1 --> C[Launch Browser]
B2 --> C
C --> D{Crawler Run Config Set?}
D -->|No| D1[Use Default CrawlerRunConfig]
D -->|Yes| D2[Custom CrawlerRunConfig]
D1 --> E[Navigate to URL]
D2 --> E
E --> F{Page Load Success?}
F -->|No| F1[Return Error Result]
F -->|Yes| G[Apply Content Filters]
G --> G1{excluded_tags set?}
G1 -->|Yes| G2[Remove specified tags]
G1 -->|No| G3[Keep all tags]
G2 --> G4{css_selector set?}
G3 --> G4
G4 -->|Yes| G5[Extract selected elements]
G4 -->|No| G6[Process full page]
G5 --> H[Generate Markdown]
G6 --> H
H --> H1{markdown_generator set?}
H1 -->|Yes| H2[Use custom generator]
H1 -->|No| H3[Use default generator]
H2 --> I[Extract Media and Links]
H3 --> I
I --> I1{process_iframes?}
I1 -->|Yes| I2[Include iframe content]
I1 -->|No| I3[Skip iframes]
I2 --> J[Create CrawlResult]
I3 --> J
J --> K[Return Result]
style A fill:#e1f5fe
style K fill:#c8e6c9
style F1 fill:#ffcdd2
```
### CrawlResult Data Structure
```mermaid
graph TB
subgraph "CrawlResult Object"
A[CrawlResult] --> B[Basic Info]
A --> C[Content Variants]
A --> D[Extracted Data]
A --> E[Media Assets]
A --> F[Optional Outputs]
B --> B1[url: Final URL]
B --> B2[success: Boolean]
B --> B3[status_code: HTTP Status]
B --> B4[error_message: Error Details]
C --> C1[html: Raw HTML]
C --> C2[cleaned_html: Sanitized HTML]
C --> C3[markdown: MarkdownGenerationResult]
C3 --> C3A[raw_markdown: Basic conversion]
C3 --> C3B[markdown_with_citations: With references]
C3 --> C3C[fit_markdown: Filtered content]
C3 --> C3D[references_markdown: Citation list]
D --> D1[links: Internal/External]
D --> D2[media: Images/Videos/Audio]
D --> D3[metadata: Page info]
D --> D4[extracted_content: JSON data]
D --> D5[tables: Structured table data]
E --> E1[screenshot: Base64 image]
E --> E2[pdf: PDF bytes]
E --> E3[mhtml: Archive file]
E --> E4[downloaded_files: File paths]
F --> F1[session_id: Browser session]
F --> F2[ssl_certificate: Security info]
F --> F3[response_headers: HTTP headers]
F --> F4[network_requests: Traffic log]
F --> F5[console_messages: Browser logs]
end
style A fill:#e3f2fd
style C3 fill:#f3e5f5
style D5 fill:#e8f5e8
```
### Content Processing Pipeline
```mermaid
flowchart LR
subgraph "Input Sources"
A1[Web URL]
A2[Raw HTML]
A3[Local File]
end
A1 --> B[Browser Navigation]
A2 --> C[Direct Processing]
A3 --> C
B --> D[Raw HTML Capture]
C --> D
D --> E{Content Filtering}
E --> E1[Remove Scripts/Styles]
E --> E2[Apply excluded_tags]
E --> E3[Apply css_selector]
E --> E4[Remove overlay elements]
E1 --> F[Cleaned HTML]
E2 --> F
E3 --> F
E4 --> F
F --> G{Markdown Generation}
G --> G1[HTML to Markdown]
G --> G2[Apply Content Filter]
G --> G3[Generate Citations]
G1 --> H[MarkdownGenerationResult]
G2 --> H
G3 --> H
F --> I{Media Extraction}
I --> I1[Find Images]
I --> I2[Find Videos/Audio]
I --> I3[Score Relevance]
I1 --> J[Media Dictionary]
I2 --> J
I3 --> J
F --> K{Link Extraction}
K --> K1[Internal Links]
K --> K2[External Links]
K --> K3[Apply Link Filters]
K1 --> L[Links Dictionary]
K2 --> L
K3 --> L
H --> M[Final CrawlResult]
J --> M
L --> M
style D fill:#e3f2fd
style F fill:#f3e5f5
style H fill:#e8f5e8
style M fill:#c8e6c9
```
### Table Extraction Workflow
```mermaid
stateDiagram-v2
[*] --> DetectTables
DetectTables --> ScoreTables: Find table elements
ScoreTables --> EvaluateThreshold: Calculate quality scores
EvaluateThreshold --> PassThreshold: score >= table_score_threshold
EvaluateThreshold --> RejectTable: score < threshold
PassThreshold --> ExtractHeaders: Parse table structure
ExtractHeaders --> ExtractRows: Get header cells
ExtractRows --> ExtractMetadata: Get data rows
ExtractMetadata --> CreateTableObject: Get caption/summary
CreateTableObject --> AddToResult: {headers, rows, caption, summary}
AddToResult --> [*]: Table extraction complete
RejectTable --> [*]: Table skipped
note right of ScoreTables : Factors: header presence, data density, structure quality
note right of EvaluateThreshold : Threshold 1-10, higher = stricter
```
### Error Handling Decision Tree
```mermaid
flowchart TD
A[Start Crawl] --> B[Navigate to URL]
B --> C{Navigation Success?}
C -->|Network Error| C1[Set error_message: Network failure]
C -->|Timeout| C2[Set error_message: Page timeout]
C -->|Invalid URL| C3[Set error_message: Invalid URL format]
C -->|Success| D[Process Page Content]
C1 --> E[success = False]
C2 --> E
C3 --> E
D --> F{Content Processing OK?}
F -->|Parser Error| F1[Set error_message: HTML parsing failed]
F -->|Memory Error| F2[Set error_message: Insufficient memory]
F -->|Success| G[Generate Outputs]
F1 --> E
F2 --> E
G --> H{Output Generation OK?}
H -->|Markdown Error| H1[Partial success with warnings]
H -->|Extraction Error| H2[Partial success with warnings]
H -->|Success| I[success = True]
H1 --> I
H2 --> I
E --> J[Return Failed CrawlResult]
I --> K[Return Successful CrawlResult]
J --> L[User Error Handling]
K --> M[User Result Processing]
L --> L1{Check error_message}
L1 -->|Network| L2[Retry with different config]
L1 -->|Timeout| L3[Increase page_timeout]
L1 -->|Parser| L4[Try different scraping_strategy]
style E fill:#ffcdd2
style I fill:#c8e6c9
style J fill:#ffcdd2
style K fill:#c8e6c9
```
### Configuration Impact Matrix
```mermaid
graph TB
subgraph "Configuration Categories"
A[Content Processing]
B[Page Interaction]
C[Output Generation]
D[Performance]
end
subgraph "Configuration Options"
A --> A1[word_count_threshold]
A --> A2[excluded_tags]
A --> A3[css_selector]
A --> A4[exclude_external_links]
B --> B1[process_iframes]
B --> B2[remove_overlay_elements]
B --> B3[scan_full_page]
B --> B4[wait_for]
C --> C1[screenshot]
C --> C2[pdf]
C --> C3[markdown_generator]
C --> C4[table_score_threshold]
D --> D1[cache_mode]
D --> D2[verbose]
D --> D3[page_timeout]
D --> D4[semaphore_count]
end
subgraph "Result Impact"
A1 --> R1[Filters short text blocks]
A2 --> R2[Removes specified HTML tags]
A3 --> R3[Focuses on selected content]
A4 --> R4[Cleans links dictionary]
B1 --> R5[Includes iframe content]
B2 --> R6[Removes popups/modals]
B3 --> R7[Loads dynamic content]
B4 --> R8[Waits for specific elements]
C1 --> R9[Adds screenshot field]
C2 --> R10[Adds pdf field]
C3 --> R11[Custom markdown processing]
C4 --> R12[Filters table quality]
D1 --> R13[Controls caching behavior]
D2 --> R14[Detailed logging output]
D3 --> R15[Prevents timeout errors]
D4 --> R16[Limits concurrent operations]
end
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
```
### Raw HTML and Local File Processing
```mermaid
sequenceDiagram
participant User
participant Crawler
participant Processor
participant FileSystem
Note over User,FileSystem: Raw HTML Processing
User->>Crawler: arun("raw://html_content")
Crawler->>Processor: Parse raw HTML directly
Processor->>Processor: Apply same content filters
Processor-->>Crawler: Standard CrawlResult
Crawler-->>User: Result with markdown
Note over User,FileSystem: Local File Processing
User->>Crawler: arun("file:///path/to/file.html")
Crawler->>FileSystem: Read local file
FileSystem-->>Crawler: File content
Crawler->>Processor: Process file HTML
Processor->>Processor: Apply content processing
Processor-->>Crawler: Standard CrawlResult
Crawler-->>User: Result with markdown
Note over User,FileSystem: Both return identical CrawlResult structure
```
### Comprehensive Processing Example Flow
```mermaid
flowchart TD
A[Input: example.com] --> B[Create Configurations]
B --> B1[BrowserConfig verbose=True]
B --> B2[CrawlerRunConfig with filters]
B1 --> C[Launch AsyncWebCrawler]
B2 --> C
C --> D[Navigate and Process]
D --> E{Check Success}
E -->|Failed| E1[Print Error Message]
E -->|Success| F[Extract Content Summary]
F --> F1[Get Page Title]
F --> F2[Get Content Preview]
F --> F3[Process Media Items]
F --> F4[Process Links]
F3 --> F3A[Count Images]
F3 --> F3B[Show First 3 Images]
F4 --> F4A[Count Internal Links]
F4 --> F4B[Show First 3 Links]
F1 --> G[Display Results]
F2 --> G
F3A --> G
F3B --> G
F4A --> G
F4B --> G
E1 --> H[End with Error]
G --> I[End with Success]
style E1 fill:#ffcdd2
style G fill:#c8e6c9
style H fill:#ffcdd2
style I fill:#c8e6c9
```
**📖 Learn more:** [Simple Crawling Guide](https://docs.crawl4ai.com/core/simple-crawling/), [Configuration Options](https://docs.crawl4ai.com/core/browser-crawler-config/), [Result Processing](https://docs.crawl4ai.com/core/crawler-result/), [Table Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/)

View File

@@ -0,0 +1,441 @@
## URL Seeding Workflows and Architecture
Visual representations of URL discovery strategies, filtering pipelines, and smart crawling workflows.
### URL Seeding vs Deep Crawling Strategy Comparison
```mermaid
graph TB
subgraph "Deep Crawling Approach"
A1[Start URL] --> A2[Load Page]
A2 --> A3[Extract Links]
A3 --> A4{More Links?}
A4 -->|Yes| A5[Queue Next Page]
A5 --> A2
A4 -->|No| A6[Complete]
A7[⏱️ Real-time Discovery]
A8[🐌 Sequential Processing]
A9[🔍 Limited by Page Structure]
A10[💾 High Memory Usage]
end
subgraph "URL Seeding Approach"
B1[Domain Input] --> B2[Query Sitemap]
B1 --> B3[Query Common Crawl]
B2 --> B4[Merge Results]
B3 --> B4
B4 --> B5[Apply Filters]
B5 --> B6[Score Relevance]
B6 --> B7[Rank Results]
B7 --> B8[Select Top URLs]
B9[⚡ Instant Discovery]
B10[🚀 Parallel Processing]
B11[🎯 Pattern-based Filtering]
B12[💡 Smart Relevance Scoring]
end
style A1 fill:#ffecb3
style B1 fill:#e8f5e8
style A6 fill:#ffcdd2
style B8 fill:#c8e6c9
```
### URL Discovery Data Flow
```mermaid
sequenceDiagram
participant User
participant Seeder as AsyncUrlSeeder
participant SM as Sitemap
participant CC as Common Crawl
participant Filter as URL Filter
participant Scorer as BM25 Scorer
User->>Seeder: urls("example.com", config)
par Parallel Data Sources
Seeder->>SM: Fetch sitemap.xml
SM-->>Seeder: 500 URLs
and
Seeder->>CC: Query Common Crawl
CC-->>Seeder: 2000 URLs
end
Seeder->>Seeder: Merge and deduplicate
Note over Seeder: 2200 unique URLs
Seeder->>Filter: Apply pattern filter
Filter-->>Seeder: 800 matching URLs
alt extract_head=True
loop For each URL
Seeder->>Seeder: Extract <head> metadata
end
Note over Seeder: Title, description, keywords
end
alt query provided
Seeder->>Scorer: Calculate relevance scores
Scorer-->>Seeder: Scored URLs
Seeder->>Seeder: Filter by score_threshold
Note over Seeder: 200 relevant URLs
end
Seeder->>Seeder: Sort by relevance
Seeder->>Seeder: Apply max_urls limit
Seeder-->>User: Top 100 URLs ready for crawling
```
### SeedingConfig Decision Tree
```mermaid
flowchart TD
A[SeedingConfig Setup] --> B{Data Source Strategy?}
B -->|Fast & Official| C[source="sitemap"]
B -->|Comprehensive| D[source="cc"]
B -->|Maximum Coverage| E[source="sitemap+cc"]
C --> F{Need Filtering?}
D --> F
E --> F
F -->|Yes| G[Set URL Pattern]
F -->|No| H[pattern="*"]
G --> I{Pattern Examples}
I --> I1[pattern="*/blog/*"]
I --> I2[pattern="*/docs/api/*"]
I --> I3[pattern="*.pdf"]
I --> I4[pattern="*/product/*"]
H --> J{Need Metadata?}
I1 --> J
I2 --> J
I3 --> J
I4 --> J
J -->|Yes| K[extract_head=True]
J -->|No| L[extract_head=False]
K --> M{Need Validation?}
L --> M
M -->|Yes| N[live_check=True]
M -->|No| O[live_check=False]
N --> P{Need Relevance Scoring?}
O --> P
P -->|Yes| Q[Set Query + BM25]
P -->|No| R[Skip Scoring]
Q --> S[query="search terms"]
S --> T[scoring_method="bm25"]
T --> U[score_threshold=0.3]
R --> V[Performance Tuning]
U --> V
V --> W[Set max_urls]
W --> X[Set concurrency]
X --> Y[Set hits_per_sec]
Y --> Z[Configuration Complete]
style A fill:#e3f2fd
style Z fill:#c8e6c9
style K fill:#fff3e0
style N fill:#fff3e0
style Q fill:#f3e5f5
```
### BM25 Relevance Scoring Pipeline
```mermaid
graph TB
subgraph "Text Corpus Preparation"
A1[URL Collection] --> A2[Extract Metadata]
A2 --> A3[Title + Description + Keywords]
A3 --> A4[Tokenize Text]
A4 --> A5[Remove Stop Words]
A5 --> A6[Create Document Corpus]
end
subgraph "BM25 Algorithm"
B1[Query Terms] --> B2[Term Frequency Calculation]
A6 --> B2
B2 --> B3[Inverse Document Frequency]
B3 --> B4[BM25 Score Calculation]
B4 --> B5[Score = Σ(IDF × TF × K1+1)/(TF + K1×(1-b+b×|d|/avgdl))]
end
subgraph "Scoring Results"
B5 --> C1[URL Relevance Scores]
C1 --> C2{Score ≥ Threshold?}
C2 -->|Yes| C3[Include in Results]
C2 -->|No| C4[Filter Out]
C3 --> C5[Sort by Score DESC]
C5 --> C6[Return Top URLs]
end
subgraph "Example Scores"
D1["python async tutorial" → 0.85]
D2["python documentation" → 0.72]
D3["javascript guide" → 0.23]
D4["contact us page" → 0.05]
end
style B5 fill:#e3f2fd
style C6 fill:#c8e6c9
style D1 fill:#c8e6c9
style D2 fill:#c8e6c9
style D3 fill:#ffecb3
style D4 fill:#ffcdd2
```
### Multi-Domain Discovery Architecture
```mermaid
graph TB
subgraph "Input Layer"
A1[Domain List]
A2[SeedingConfig]
A3[Query Terms]
end
subgraph "Discovery Engine"
B1[AsyncUrlSeeder]
B2[Parallel Workers]
B3[Rate Limiter]
B4[Memory Manager]
end
subgraph "Data Sources"
C1[Sitemap Fetcher]
C2[Common Crawl API]
C3[Live URL Checker]
C4[Metadata Extractor]
end
subgraph "Processing Pipeline"
D1[URL Deduplication]
D2[Pattern Filtering]
D3[Relevance Scoring]
D4[Quality Assessment]
end
subgraph "Output Layer"
E1[Scored URL Lists]
E2[Domain Statistics]
E3[Performance Metrics]
E4[Cache Storage]
end
A1 --> B1
A2 --> B1
A3 --> B1
B1 --> B2
B2 --> B3
B3 --> B4
B2 --> C1
B2 --> C2
B2 --> C3
B2 --> C4
C1 --> D1
C2 --> D1
C3 --> D2
C4 --> D3
D1 --> D2
D2 --> D3
D3 --> D4
D4 --> E1
B4 --> E2
B3 --> E3
D1 --> E4
style B1 fill:#e3f2fd
style D3 fill:#f3e5f5
style E1 fill:#c8e6c9
```
### Complete Discovery-to-Crawl Pipeline
```mermaid
stateDiagram-v2
[*] --> Discovery
Discovery --> SourceSelection: Configure data sources
SourceSelection --> Sitemap: source="sitemap"
SourceSelection --> CommonCrawl: source="cc"
SourceSelection --> Both: source="sitemap+cc"
Sitemap --> URLCollection
CommonCrawl --> URLCollection
Both --> URLCollection
URLCollection --> Filtering: Apply patterns
Filtering --> MetadataExtraction: extract_head=True
Filtering --> LiveValidation: extract_head=False
MetadataExtraction --> LiveValidation: live_check=True
MetadataExtraction --> RelevanceScoring: live_check=False
LiveValidation --> RelevanceScoring
RelevanceScoring --> ResultRanking: query provided
RelevanceScoring --> ResultLimiting: no query
ResultRanking --> ResultLimiting: apply score_threshold
ResultLimiting --> URLSelection: apply max_urls
URLSelection --> CrawlPreparation: URLs ready
CrawlPreparation --> CrawlExecution: AsyncWebCrawler
CrawlExecution --> StreamProcessing: stream=True
CrawlExecution --> BatchProcessing: stream=False
StreamProcessing --> [*]
BatchProcessing --> [*]
note right of Discovery : 🔍 Smart URL Discovery
note right of URLCollection : 📚 Merge & Deduplicate
note right of RelevanceScoring : 🎯 BM25 Algorithm
note right of CrawlExecution : 🕷️ High-Performance Crawling
```
### Performance Optimization Strategies
```mermaid
graph LR
subgraph "Input Optimization"
A1[Smart Source Selection] --> A2[Sitemap First]
A2 --> A3[Add CC if Needed]
A3 --> A4[Pattern Filtering Early]
end
subgraph "Processing Optimization"
B1[Parallel Workers] --> B2[Bounded Queues]
B2 --> B3[Rate Limiting]
B3 --> B4[Memory Management]
B4 --> B5[Lazy Evaluation]
end
subgraph "Output Optimization"
C1[Relevance Threshold] --> C2[Max URL Limits]
C2 --> C3[Caching Strategy]
C3 --> C4[Streaming Results]
end
subgraph "Performance Metrics"
D1[URLs/Second: 100-1000]
D2[Memory Usage: Bounded]
D3[Network Efficiency: 95%+]
D4[Cache Hit Rate: 80%+]
end
A4 --> B1
B5 --> C1
C4 --> D1
style A2 fill:#e8f5e8
style B2 fill:#e3f2fd
style C3 fill:#f3e5f5
style D3 fill:#c8e6c9
```
### URL Discovery vs Traditional Crawling Comparison
```mermaid
graph TB
subgraph "Traditional Approach"
T1[Start URL] --> T2[Crawl Page]
T2 --> T3[Extract Links]
T3 --> T4[Queue New URLs]
T4 --> T2
T5[❌ Time: Hours/Days]
T6[❌ Resource Heavy]
T7[❌ Depth Limited]
T8[❌ Discovery Bias]
end
subgraph "URL Seeding Approach"
S1[Domain Input] --> S2[Query All Sources]
S2 --> S3[Pattern Filter]
S3 --> S4[Relevance Score]
S4 --> S5[Select Best URLs]
S5 --> S6[Ready to Crawl]
S7[✅ Time: Seconds/Minutes]
S8[✅ Resource Efficient]
S9[✅ Complete Coverage]
S10[✅ Quality Focused]
end
subgraph "Use Case Decision Matrix"
U1[Small Sites < 1000 pages] --> U2[Use Deep Crawling]
U3[Large Sites > 10000 pages] --> U4[Use URL Seeding]
U5[Unknown Structure] --> U6[Start with Seeding]
U7[Real-time Discovery] --> U8[Use Deep Crawling]
U9[Quality over Quantity] --> U10[Use URL Seeding]
end
style S6 fill:#c8e6c9
style S7 fill:#c8e6c9
style S8 fill:#c8e6c9
style S9 fill:#c8e6c9
style S10 fill:#c8e6c9
style T5 fill:#ffcdd2
style T6 fill:#ffcdd2
style T7 fill:#ffcdd2
style T8 fill:#ffcdd2
```
### Data Source Characteristics and Selection
```mermaid
graph TB
subgraph "Sitemap Source"
SM1[📋 Official URL List]
SM2[⚡ Fast Response]
SM3[📅 Recently Updated]
SM4[🎯 High Quality URLs]
SM5[❌ May Miss Some Pages]
end
subgraph "Common Crawl Source"
CC1[🌐 Comprehensive Coverage]
CC2[📚 Historical Data]
CC3[🔍 Deep Discovery]
CC4[⏳ Slower Response]
CC5[🧹 May Include Noise]
end
subgraph "Combined Strategy"
CB1[🚀 Best of Both]
CB2[📊 Maximum Coverage]
CB3[✨ Automatic Deduplication]
CB4[⚖️ Balanced Performance]
end
subgraph "Selection Guidelines"
G1[Speed Critical → Sitemap Only]
G2[Coverage Critical → Common Crawl]
G3[Best Quality → Combined]
G4[Unknown Domain → Combined]
end
style SM2 fill:#c8e6c9
style SM4 fill:#c8e6c9
style CC1 fill:#e3f2fd
style CC3 fill:#e3f2fd
style CB1 fill:#f3e5f5
style CB3 fill:#f3e5f5
```
**📖 Learn more:** [URL Seeding Guide](https://docs.crawl4ai.com/core/url-seeding/), [Performance Optimization](https://docs.crawl4ai.com/advanced/optimization/), [Multi-URL Crawling](https://docs.crawl4ai.com/advanced/multi-url-crawling/)

View File

@@ -0,0 +1,295 @@
## CLI & Identity-Based Browsing
Command-line interface for web crawling with persistent browser profiles, authentication, and identity management.
### Basic CLI Usage
```bash
# Simple crawling
crwl https://example.com
# Get markdown output
crwl https://example.com -o markdown
# JSON output with cache bypass
crwl https://example.com -o json --bypass-cache
# Verbose mode with specific browser settings
crwl https://example.com -b "headless=false,viewport_width=1280" -v
```
### Profile Management Commands
```bash
# Launch interactive profile manager
crwl profiles
# Create, list, and manage browser profiles
# This opens a menu where you can:
# 1. List existing profiles
# 2. Create new profile (opens browser for setup)
# 3. Delete profiles
# 4. Use profile to crawl a website
# Use a specific profile for crawling
crwl https://example.com -p my-profile-name
# Example workflow for authenticated sites:
# 1. Create profile and log in
crwl profiles # Select "Create new profile"
# 2. Use profile for crawling authenticated content
crwl https://site-requiring-login.com/dashboard -p my-profile-name
```
### CDP Browser Management
```bash
# Launch browser with CDP debugging (default port 9222)
crwl cdp
# Use specific profile and custom port
crwl cdp -p my-profile -P 9223
# Launch headless browser with CDP
crwl cdp --headless
# Launch in incognito mode (ignores profile)
crwl cdp --incognito
# Use custom user data directory
crwl cdp --user-data-dir ~/my-browser-data --port 9224
```
### Builtin Browser Management
```bash
# Start persistent browser instance
crwl browser start
# Check browser status
crwl browser status
# Open visible window to see the browser
crwl browser view --url https://example.com
# Stop the browser
crwl browser stop
# Restart with different options
crwl browser restart --browser-type chromium --port 9223 --no-headless
# Use builtin browser in crawling
crwl https://example.com -b "browser_mode=builtin"
```
### Authentication Workflow Examples
```bash
# Complete workflow for LinkedIn scraping
# 1. Create authenticated profile
crwl profiles
# Select "Create new profile" → login to LinkedIn in browser → press 'q' to save
# 2. Use profile for crawling
crwl https://linkedin.com/in/someone -p linkedin-profile -o markdown
# 3. Extract structured data with authentication
crwl https://linkedin.com/search/results/people/ \
-p linkedin-profile \
-j "Extract people profiles with names, titles, and companies" \
-b "headless=false"
# GitHub authenticated crawling
crwl profiles # Create github-profile
crwl https://github.com/settings/profile -p github-profile
# Twitter/X authenticated access
crwl profiles # Create twitter-profile
crwl https://twitter.com/home -p twitter-profile -o markdown
```
### Advanced CLI Configuration
```bash
# Complex crawling with multiple configs
crwl https://example.com \
-B browser.yml \
-C crawler.yml \
-e extract_llm.yml \
-s llm_schema.json \
-p my-auth-profile \
-o json \
-v
# Quick LLM extraction with authentication
crwl https://private-site.com/dashboard \
-p auth-profile \
-j "Extract user dashboard data including metrics and notifications" \
-b "headless=true,viewport_width=1920"
# Content filtering with authentication
crwl https://members-only-site.com \
-p member-profile \
-f filter_bm25.yml \
-c "css_selector=.member-content,scan_full_page=true" \
-o markdown-fit
```
### Configuration Files for Identity Browsing
```yaml
# browser_auth.yml
headless: false
use_managed_browser: true
user_data_dir: "/path/to/profile"
viewport_width: 1280
viewport_height: 720
simulate_user: true
override_navigator: true
# crawler_auth.yml
magic: true
remove_overlay_elements: true
simulate_user: true
wait_for: "css:.authenticated-content"
page_timeout: 60000
delay_before_return_html: 2
scan_full_page: true
```
### Global Configuration Management
```bash
# List all configuration settings
crwl config list
# Set default LLM provider
crwl config set DEFAULT_LLM_PROVIDER "anthropic/claude-3-sonnet"
crwl config set DEFAULT_LLM_PROVIDER_TOKEN "your-api-token"
# Set browser defaults
crwl config set BROWSER_HEADLESS false # Always show browser
crwl config set USER_AGENT_MODE random # Random user agents
# Enable verbose mode globally
crwl config set VERBOSE true
```
### Q&A with Authenticated Content
```bash
# Ask questions about authenticated content
crwl https://private-dashboard.com -p dashboard-profile \
-q "What are the key metrics shown in my dashboard?"
# Multiple questions workflow
crwl https://company-intranet.com -p work-profile -o markdown # View content
crwl https://company-intranet.com -p work-profile \
-q "Summarize this week's announcements"
crwl https://company-intranet.com -p work-profile \
-q "What are the upcoming deadlines?"
```
### Profile Creation Programmatically
```python
# Create profiles via Python API
import asyncio
from crawl4ai import BrowserProfiler
async def create_auth_profile():
profiler = BrowserProfiler()
# Create profile interactively (opens browser)
profile_path = await profiler.create_profile("linkedin-auth")
print(f"Profile created at: {profile_path}")
# List all profiles
profiles = profiler.list_profiles()
for profile in profiles:
print(f"Profile: {profile['name']} at {profile['path']}")
# Use profile for crawling
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_config = BrowserConfig(
headless=True,
use_managed_browser=True,
user_data_dir=profile_path
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://linkedin.com/feed")
return result
# asyncio.run(create_auth_profile())
```
### Identity Browsing Best Practices
```bash
# 1. Create specific profiles for different sites
crwl profiles # Create "linkedin-work"
crwl profiles # Create "github-personal"
crwl profiles # Create "company-intranet"
# 2. Use descriptive profile names
crwl https://site1.com -p site1-admin-account
crwl https://site2.com -p site2-user-account
# 3. Combine with appropriate browser settings
crwl https://secure-site.com \
-p secure-profile \
-b "headless=false,simulate_user=true,magic=true" \
-c "wait_for=.logged-in-indicator,page_timeout=30000"
# 4. Test profile before automated crawling
crwl cdp -p test-profile # Manually verify login status
crwl https://test-url.com -p test-profile -v # Verbose test crawl
```
### Troubleshooting Authentication Issues
```bash
# Debug authentication problems
crwl https://auth-site.com -p auth-profile \
-b "headless=false,verbose=true" \
-c "verbose=true,page_timeout=60000" \
-v
# Check profile status
crwl profiles # List profiles and check creation dates
# Recreate problematic profiles
crwl profiles # Delete old profile, create new one
# Test with visible browser
crwl https://problem-site.com -p profile-name \
-b "headless=false" \
-c "delay_before_return_html=5"
```
### Common Use Cases
```bash
# Social media monitoring (after authentication)
crwl https://twitter.com/home -p twitter-monitor \
-j "Extract latest tweets with sentiment and engagement metrics"
# E-commerce competitor analysis (with account access)
crwl https://competitor-site.com/products -p competitor-account \
-j "Extract product prices, availability, and descriptions"
# Company dashboard monitoring
crwl https://company-dashboard.com -p work-profile \
-c "css_selector=.dashboard-content" \
-q "What alerts or notifications need attention?"
# Research data collection (authenticated access)
crwl https://research-platform.com/data -p research-profile \
-e extract_research.yml \
-s research_schema.json \
-o json
```
**📖 Learn more:** [Identity-Based Crawling Documentation](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Browser Profile Management](https://docs.crawl4ai.com/advanced/session-management/), [CLI Examples](https://docs.crawl4ai.com/core/cli/)

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,446 @@
## Deep Crawling Filters & Scorers
Advanced URL filtering and scoring strategies for intelligent deep crawling with performance optimization.
### URL Filters - Content and Domain Control
```python
from crawl4ai.deep_crawling.filters import (
URLPatternFilter, DomainFilter, ContentTypeFilter,
FilterChain, ContentRelevanceFilter, SEOFilter
)
# Pattern-based filtering
pattern_filter = URLPatternFilter(
patterns=[
"*.html", # HTML pages only
"*/blog/*", # Blog posts
"*/articles/*", # Article pages
"*2024*", # Recent content
"^https://example.com/docs/.*" # Regex pattern
],
use_glob=True,
reverse=False # False = include matching, True = exclude matching
)
# Domain filtering with subdomains
domain_filter = DomainFilter(
allowed_domains=["example.com", "docs.example.com"],
blocked_domains=["ads.example.com", "tracker.com"]
)
# Content type filtering
content_filter = ContentTypeFilter(
allowed_types=["text/html", "application/pdf"],
check_extension=True
)
# Apply individual filters
url = "https://example.com/blog/2024/article.html"
print(f"Pattern filter: {pattern_filter.apply(url)}")
print(f"Domain filter: {domain_filter.apply(url)}")
print(f"Content filter: {content_filter.apply(url)}")
```
### Filter Chaining - Combine Multiple Filters
```python
# Create filter chain for comprehensive filtering
filter_chain = FilterChain([
DomainFilter(allowed_domains=["example.com"]),
URLPatternFilter(patterns=["*/blog/*", "*/docs/*"]),
ContentTypeFilter(allowed_types=["text/html"])
])
# Apply chain to URLs
urls = [
"https://example.com/blog/post1.html",
"https://spam.com/content.html",
"https://example.com/blog/image.jpg",
"https://example.com/docs/guide.html"
]
async def filter_urls(urls, filter_chain):
filtered = []
for url in urls:
if await filter_chain.apply(url):
filtered.append(url)
return filtered
# Usage
filtered_urls = await filter_urls(urls, filter_chain)
print(f"Filtered URLs: {filtered_urls}")
# Check filter statistics
for filter_obj in filter_chain.filters:
stats = filter_obj.stats
print(f"{filter_obj.name}: {stats.passed_urls}/{stats.total_urls} passed")
```
### Advanced Content Filters
```python
# BM25-based content relevance filtering
relevance_filter = ContentRelevanceFilter(
query="python machine learning tutorial",
threshold=0.5, # Minimum relevance score
k1=1.2, # TF saturation parameter
b=0.75, # Length normalization
avgdl=1000 # Average document length
)
# SEO quality filtering
seo_filter = SEOFilter(
threshold=0.65, # Minimum SEO score
keywords=["python", "tutorial", "guide"],
weights={
"title_length": 0.15,
"title_kw": 0.18,
"meta_description": 0.12,
"canonical": 0.10,
"robot_ok": 0.20,
"schema_org": 0.10,
"url_quality": 0.15
}
)
# Apply advanced filters
url = "https://example.com/python-ml-tutorial"
relevance_score = await relevance_filter.apply(url)
seo_score = await seo_filter.apply(url)
print(f"Relevance: {relevance_score}, SEO: {seo_score}")
```
### URL Scorers - Quality and Relevance Scoring
```python
from crawl4ai.deep_crawling.scorers import (
KeywordRelevanceScorer, PathDepthScorer, ContentTypeScorer,
FreshnessScorer, DomainAuthorityScorer, CompositeScorer
)
# Keyword relevance scoring
keyword_scorer = KeywordRelevanceScorer(
keywords=["python", "tutorial", "guide", "machine", "learning"],
weight=1.0,
case_sensitive=False
)
# Path depth scoring (optimal depth = 3)
depth_scorer = PathDepthScorer(
optimal_depth=3, # /category/subcategory/article
weight=0.8
)
# Content type scoring
content_type_scorer = ContentTypeScorer(
type_weights={
"html": 1.0, # Highest priority
"pdf": 0.8, # Medium priority
"txt": 0.6, # Lower priority
"doc": 0.4 # Lowest priority
},
weight=0.9
)
# Freshness scoring
freshness_scorer = FreshnessScorer(
weight=0.7,
current_year=2024
)
# Domain authority scoring
domain_scorer = DomainAuthorityScorer(
domain_weights={
"python.org": 1.0,
"github.com": 0.9,
"stackoverflow.com": 0.85,
"medium.com": 0.7,
"personal-blog.com": 0.3
},
default_weight=0.5,
weight=1.0
)
# Score individual URLs
url = "https://python.org/tutorial/2024/machine-learning.html"
scores = {
"keyword": keyword_scorer.score(url),
"depth": depth_scorer.score(url),
"content": content_type_scorer.score(url),
"freshness": freshness_scorer.score(url),
"domain": domain_scorer.score(url)
}
print(f"Individual scores: {scores}")
```
### Composite Scoring - Combine Multiple Scorers
```python
# Create composite scorer combining all strategies
composite_scorer = CompositeScorer(
scorers=[
KeywordRelevanceScorer(["python", "tutorial"], weight=1.5),
PathDepthScorer(optimal_depth=3, weight=1.0),
ContentTypeScorer({"html": 1.0, "pdf": 0.8}, weight=1.2),
FreshnessScorer(weight=0.8, current_year=2024),
DomainAuthorityScorer({
"python.org": 1.0,
"github.com": 0.9
}, weight=1.3)
],
normalize=True # Normalize by number of scorers
)
# Score multiple URLs
urls_to_score = [
"https://python.org/tutorial/2024/basics.html",
"https://github.com/user/python-guide/blob/main/README.md",
"https://random-blog.com/old/2018/python-stuff.html",
"https://python.org/docs/deep/nested/advanced/guide.html"
]
scored_urls = []
for url in urls_to_score:
score = composite_scorer.score(url)
scored_urls.append((url, score))
# Sort by score (highest first)
scored_urls.sort(key=lambda x: x[1], reverse=True)
for url, score in scored_urls:
print(f"Score: {score:.3f} - {url}")
# Check scorer statistics
print(f"\nScoring statistics:")
print(f"URLs scored: {composite_scorer.stats._urls_scored}")
print(f"Average score: {composite_scorer.stats.get_average():.3f}")
```
### Advanced Filter Patterns
```python
# Complex pattern matching
advanced_patterns = URLPatternFilter(
patterns=[
r"^https://docs\.python\.org/\d+/", # Python docs with version
r".*/tutorial/.*\.html$", # Tutorial pages
r".*/guide/(?!deprecated).*", # Guides but not deprecated
"*/blog/{2020,2021,2022,2023,2024}/*", # Recent blog posts
"**/{api,reference}/**/*.html" # API/reference docs
],
use_glob=True
)
# Exclude patterns (reverse=True)
exclude_filter = URLPatternFilter(
patterns=[
"*/admin/*",
"*/login/*",
"*/private/*",
"**/.*", # Hidden files
"*.{jpg,png,gif,css,js}$" # Media and assets
],
reverse=True # Exclude matching patterns
)
# Content type with extension mapping
detailed_content_filter = ContentTypeFilter(
allowed_types=["text", "application"],
check_extension=True,
ext_map={
"html": "text/html",
"htm": "text/html",
"md": "text/markdown",
"pdf": "application/pdf",
"doc": "application/msword",
"docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
}
)
```
### Performance-Optimized Filtering
```python
# High-performance filter chain for large-scale crawling
class OptimizedFilterChain:
def __init__(self):
# Fast filters first (domain, patterns)
self.fast_filters = [
DomainFilter(
allowed_domains=["example.com", "docs.example.com"],
blocked_domains=["ads.example.com"]
),
URLPatternFilter([
"*.html", "*.pdf", "*/blog/*", "*/docs/*"
])
]
# Slower filters last (content analysis)
self.slow_filters = [
ContentRelevanceFilter(
query="important content",
threshold=0.3
)
]
async def apply_optimized(self, url: str) -> bool:
# Apply fast filters first
for filter_obj in self.fast_filters:
if not filter_obj.apply(url):
return False
# Only apply slow filters if fast filters pass
for filter_obj in self.slow_filters:
if not await filter_obj.apply(url):
return False
return True
# Batch filtering with concurrency
async def batch_filter_urls(urls, filter_chain, max_concurrent=50):
import asyncio
semaphore = asyncio.Semaphore(max_concurrent)
async def filter_single(url):
async with semaphore:
return await filter_chain.apply(url), url
tasks = [filter_single(url) for url in urls]
results = await asyncio.gather(*tasks)
return [url for passed, url in results if passed]
# Usage with 1000 URLs
large_url_list = [f"https://example.com/page{i}.html" for i in range(1000)]
optimized_chain = OptimizedFilterChain()
filtered = await batch_filter_urls(large_url_list, optimized_chain)
```
### Custom Filter Implementation
```python
from crawl4ai.deep_crawling.filters import URLFilter
import re
class CustomLanguageFilter(URLFilter):
"""Filter URLs by language indicators"""
def __init__(self, allowed_languages=["en"], weight=1.0):
super().__init__()
self.allowed_languages = set(allowed_languages)
self.lang_patterns = {
"en": re.compile(r"/en/|/english/|lang=en"),
"es": re.compile(r"/es/|/spanish/|lang=es"),
"fr": re.compile(r"/fr/|/french/|lang=fr"),
"de": re.compile(r"/de/|/german/|lang=de")
}
def apply(self, url: str) -> bool:
# Default to English if no language indicators
if not any(pattern.search(url) for pattern in self.lang_patterns.values()):
result = "en" in self.allowed_languages
self._update_stats(result)
return result
# Check for allowed languages
for lang in self.allowed_languages:
if lang in self.lang_patterns:
if self.lang_patterns[lang].search(url):
self._update_stats(True)
return True
self._update_stats(False)
return False
# Custom scorer implementation
from crawl4ai.deep_crawling.scorers import URLScorer
class CustomComplexityScorer(URLScorer):
"""Score URLs by content complexity indicators"""
def __init__(self, weight=1.0):
super().__init__(weight)
self.complexity_indicators = {
"tutorial": 0.9,
"guide": 0.8,
"example": 0.7,
"reference": 0.6,
"api": 0.5
}
def _calculate_score(self, url: str) -> float:
url_lower = url.lower()
max_score = 0.0
for indicator, score in self.complexity_indicators.items():
if indicator in url_lower:
max_score = max(max_score, score)
return max_score
# Use custom filters and scorers
custom_filter = CustomLanguageFilter(allowed_languages=["en", "es"])
custom_scorer = CustomComplexityScorer(weight=1.2)
url = "https://example.com/en/tutorial/advanced-guide.html"
passes_filter = custom_filter.apply(url)
complexity_score = custom_scorer.score(url)
print(f"Passes language filter: {passes_filter}")
print(f"Complexity score: {complexity_score}")
```
### Integration with Deep Crawling
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import DeepCrawlStrategy
async def deep_crawl_with_filtering():
# Create comprehensive filter chain
filter_chain = FilterChain([
DomainFilter(allowed_domains=["python.org"]),
URLPatternFilter(["*/tutorial/*", "*/guide/*", "*/docs/*"]),
ContentTypeFilter(["text/html"]),
SEOFilter(threshold=0.6, keywords=["python", "programming"])
])
# Create composite scorer
scorer = CompositeScorer([
KeywordRelevanceScorer(["python", "tutorial"], weight=1.5),
FreshnessScorer(weight=0.8),
PathDepthScorer(optimal_depth=3, weight=1.0)
], normalize=True)
# Configure deep crawl strategy with filters and scorers
deep_strategy = DeepCrawlStrategy(
max_depth=3,
max_pages=100,
url_filter=filter_chain,
url_scorer=scorer,
score_threshold=0.6 # Only crawl URLs scoring above 0.6
)
config = CrawlerRunConfig(
deep_crawl_strategy=deep_strategy,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://python.org",
config=config
)
print(f"Deep crawl completed: {result.success}")
if hasattr(result, 'deep_crawl_results'):
print(f"Pages crawled: {len(result.deep_crawl_results)}")
# Run the deep crawl
await deep_crawl_with_filtering()
```
**📖 Learn more:** [Deep Crawling Strategy](https://docs.crawl4ai.com/core/deep-crawling/), [Custom Filter Development](https://docs.crawl4ai.com/advanced/custom-filters/), [Performance Optimization](https://docs.crawl4ai.com/advanced/performance-tuning/)

View File

@@ -0,0 +1,348 @@
## Deep Crawling
Multi-level website exploration with intelligent filtering, scoring, and prioritization strategies.
### Basic Deep Crawl Setup
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
# Basic breadth-first deep crawling
async def basic_deep_crawl():
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2, # Initial page + 2 levels
include_external=False # Stay within same domain
),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun("https://docs.crawl4ai.com", config=config)
# Group results by depth
pages_by_depth = {}
for result in results:
depth = result.metadata.get("depth", 0)
if depth not in pages_by_depth:
pages_by_depth[depth] = []
pages_by_depth[depth].append(result.url)
print(f"Crawled {len(results)} pages total")
for depth, urls in sorted(pages_by_depth.items()):
print(f"Depth {depth}: {len(urls)} pages")
```
### Deep Crawl Strategies
```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy, DFSDeepCrawlStrategy, BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
# Breadth-First Search - explores all links at one depth before going deeper
bfs_strategy = BFSDeepCrawlStrategy(
max_depth=2,
include_external=False,
max_pages=50, # Limit total pages
score_threshold=0.3 # Minimum score for URLs
)
# Depth-First Search - explores as deep as possible before backtracking
dfs_strategy = DFSDeepCrawlStrategy(
max_depth=2,
include_external=False,
max_pages=30,
score_threshold=0.5
)
# Best-First - prioritizes highest scoring pages (recommended)
keyword_scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration"],
weight=0.7
)
best_first_strategy = BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
url_scorer=keyword_scorer,
max_pages=25 # No score_threshold needed - naturally prioritizes
)
# Usage
config = CrawlerRunConfig(
deep_crawl_strategy=best_first_strategy, # Choose your strategy
scraping_strategy=LXMLWebScrapingStrategy()
)
```
### Streaming vs Batch Processing
```python
# Batch mode - wait for all results
async def batch_deep_crawl():
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
stream=False # Default - collect all results first
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun("https://example.com", config=config)
# Process all results at once
for result in results:
print(f"Batch processed: {result.url}")
# Streaming mode - process results as they arrive
async def streaming_deep_crawl():
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
stream=True # Process results immediately
)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://example.com", config=config):
depth = result.metadata.get("depth", 0)
print(f"Stream processed depth {depth}: {result.url}")
```
### Filtering with Filter Chains
```python
from crawl4ai.deep_crawling.filters import (
FilterChain,
URLPatternFilter,
DomainFilter,
ContentTypeFilter,
SEOFilter,
ContentRelevanceFilter
)
# Single URL pattern filter
url_filter = URLPatternFilter(patterns=["*core*", "*guide*"])
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
filter_chain=FilterChain([url_filter])
)
)
# Multiple filters in chain
advanced_filter_chain = FilterChain([
# Domain filtering
DomainFilter(
allowed_domains=["docs.example.com"],
blocked_domains=["old.docs.example.com", "staging.example.com"]
),
# URL pattern matching
URLPatternFilter(patterns=["*tutorial*", "*guide*", "*blog*"]),
# Content type filtering
ContentTypeFilter(allowed_types=["text/html"]),
# SEO quality filter
SEOFilter(
threshold=0.5,
keywords=["tutorial", "guide", "documentation"]
),
# Content relevance filter
ContentRelevanceFilter(
query="Web crawling and data extraction with Python",
threshold=0.7
)
])
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
filter_chain=advanced_filter_chain
)
)
```
### Intelligent Crawling with Scorers
```python
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
# Keyword relevance scoring
async def scored_deep_crawl():
keyword_scorer = KeywordRelevanceScorer(
keywords=["browser", "crawler", "web", "automation"],
weight=1.0
)
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
url_scorer=keyword_scorer
),
stream=True, # Recommended with BestFirst
verbose=True
)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
score = result.metadata.get("score", 0)
depth = result.metadata.get("depth", 0)
print(f"Depth: {depth} | Score: {score:.2f} | {result.url}")
```
### Limiting Crawl Size
```python
# Max pages limitation across strategies
async def limited_crawls():
# BFS with page limit
bfs_config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
max_pages=5, # Only crawl 5 pages total
url_scorer=KeywordRelevanceScorer(keywords=["browser", "crawler"], weight=1.0)
)
)
# DFS with score threshold
dfs_config = CrawlerRunConfig(
deep_crawl_strategy=DFSDeepCrawlStrategy(
max_depth=2,
score_threshold=0.7, # Only URLs with scores above 0.7
max_pages=10,
url_scorer=KeywordRelevanceScorer(keywords=["web", "automation"], weight=1.0)
)
)
# Best-First with both constraints
bf_config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
max_pages=7, # Automatically gets highest scored pages
url_scorer=KeywordRelevanceScorer(keywords=["crawl", "example"], weight=1.0)
),
stream=True
)
async with AsyncWebCrawler() as crawler:
# Use any of the configs
async for result in await crawler.arun("https://docs.crawl4ai.com", config=bf_config):
score = result.metadata.get("score", 0)
print(f"Score: {score:.2f} | {result.url}")
```
### Complete Advanced Deep Crawler
```python
async def comprehensive_deep_crawl():
# Sophisticated filter chain
filter_chain = FilterChain([
DomainFilter(
allowed_domains=["docs.crawl4ai.com"],
blocked_domains=["old.docs.crawl4ai.com"]
),
URLPatternFilter(patterns=["*core*", "*advanced*", "*blog*"]),
ContentTypeFilter(allowed_types=["text/html"]),
SEOFilter(threshold=0.4, keywords=["crawl", "tutorial", "guide"])
])
# Multi-keyword scorer
keyword_scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration", "browser"],
weight=0.8
)
# Complete configuration
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
filter_chain=filter_chain,
url_scorer=keyword_scorer,
max_pages=20
),
scraping_strategy=LXMLWebScrapingStrategy(),
stream=True,
verbose=True,
cache_mode=CacheMode.BYPASS
)
# Execute and analyze
results = []
start_time = time.time()
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
results.append(result)
score = result.metadata.get("score", 0)
depth = result.metadata.get("depth", 0)
print(f"→ Depth: {depth} | Score: {score:.2f} | {result.url}")
# Performance analysis
duration = time.time() - start_time
avg_score = sum(r.metadata.get('score', 0) for r in results) / len(results)
print(f"✅ Crawled {len(results)} pages in {duration:.2f}s")
print(f"✅ Average relevance score: {avg_score:.2f}")
# Depth distribution
depth_counts = {}
for result in results:
depth = result.metadata.get("depth", 0)
depth_counts[depth] = depth_counts.get(depth, 0) + 1
for depth, count in sorted(depth_counts.items()):
print(f"📊 Depth {depth}: {count} pages")
```
### Error Handling and Robustness
```python
async def robust_deep_crawl():
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
max_pages=15,
url_scorer=KeywordRelevanceScorer(keywords=["guide", "tutorial"])
),
stream=True,
page_timeout=30000 # 30 second timeout per page
)
successful_pages = []
failed_pages = []
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
if result.success:
successful_pages.append(result)
depth = result.metadata.get("depth", 0)
score = result.metadata.get("score", 0)
print(f"✅ Depth {depth} | Score: {score:.2f} | {result.url}")
else:
failed_pages.append({
'url': result.url,
'error': result.error_message,
'depth': result.metadata.get("depth", 0)
})
print(f"❌ Failed: {result.url} - {result.error_message}")
print(f"📊 Results: {len(successful_pages)} successful, {len(failed_pages)} failed")
# Analyze failures by depth
if failed_pages:
failure_by_depth = {}
for failure in failed_pages:
depth = failure['depth']
failure_by_depth[depth] = failure_by_depth.get(depth, 0) + 1
print("❌ Failures by depth:")
for depth, count in sorted(failure_by_depth.items()):
print(f" Depth {depth}: {count} failures")
```
**📖 Learn more:** [Deep Crawling Guide](https://docs.crawl4ai.com/core/deep-crawling/), [Filter Documentation](https://docs.crawl4ai.com/core/content-selection/), [Scoring Strategies](https://docs.crawl4ai.com/advanced/advanced-features/)

View File

@@ -0,0 +1,826 @@
## Docker Deployment
Complete Docker deployment guide with pre-built images, API endpoints, configuration, and MCP integration.
### Quick Start with Pre-built Images
```bash
# Pull latest image
docker pull unclecode/crawl4ai:latest
# Setup LLM API keys
cat > .llm.env << EOL
OPENAI_API_KEY=sk-your-key
ANTHROPIC_API_KEY=your-anthropic-key
GROQ_API_KEY=your-groq-key
GEMINI_API_TOKEN=your-gemini-token
EOL
# Run with LLM support
docker run -d \
-p 11235:11235 \
--name crawl4ai \
--env-file .llm.env \
--shm-size=1g \
unclecode/crawl4ai:latest
# Basic run (no LLM)
docker run -d \
-p 11235:11235 \
--name crawl4ai \
--shm-size=1g \
unclecode/crawl4ai:latest
# Check health
curl http://localhost:11235/health
```
### Docker Compose Deployment
```bash
# Clone and setup
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
cp deploy/docker/.llm.env.example .llm.env
# Edit .llm.env with your API keys
# Run pre-built image
IMAGE=unclecode/crawl4ai:latest docker compose up -d
# Build locally
docker compose up --build -d
# Build with all features
INSTALL_TYPE=all docker compose up --build -d
# Build with GPU support
ENABLE_GPU=true docker compose up --build -d
# Stop service
docker compose down
```
### Manual Build with Multi-Architecture
```bash
# Clone repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
# Build for current architecture
docker buildx build -t crawl4ai-local:latest --load .
# Build for multiple architectures
docker buildx build --platform linux/amd64,linux/arm64 \
-t crawl4ai-local:latest --load .
# Build with specific features
docker buildx build \
--build-arg INSTALL_TYPE=all \
--build-arg ENABLE_GPU=false \
-t crawl4ai-local:latest --load .
# Run custom build
docker run -d \
-p 11235:11235 \
--name crawl4ai-custom \
--env-file .llm.env \
--shm-size=1g \
crawl4ai-local:latest
```
### Build Arguments
```bash
# Available build options
docker buildx build \
--build-arg INSTALL_TYPE=all \ # default|all|torch|transformer
--build-arg ENABLE_GPU=true \ # true|false
--build-arg APP_HOME=/app \ # Install path
--build-arg USE_LOCAL=true \ # Use local source
--build-arg GITHUB_REPO=url \ # Git repo if USE_LOCAL=false
--build-arg GITHUB_BRANCH=main \ # Git branch
-t crawl4ai-custom:latest --load .
```
### Core API Endpoints
```python
# Main crawling endpoints
import requests
import json
# Basic crawl
payload = {
"urls": ["https://example.com"],
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "bypass"}}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
# Streaming crawl
payload["crawler_config"]["params"]["stream"] = True
response = requests.post("http://localhost:11235/crawl/stream", json=payload)
# Health check
response = requests.get("http://localhost:11235/health")
# API schema
response = requests.get("http://localhost:11235/schema")
# Metrics (Prometheus format)
response = requests.get("http://localhost:11235/metrics")
```
### Specialized Endpoints
```python
# HTML extraction (preprocessed for schema)
response = requests.post("http://localhost:11235/html",
json={"url": "https://example.com"})
# Screenshot capture
response = requests.post("http://localhost:11235/screenshot", json={
"url": "https://example.com",
"screenshot_wait_for": 2,
"output_path": "/path/to/save/screenshot.png"
})
# PDF generation
response = requests.post("http://localhost:11235/pdf", json={
"url": "https://example.com",
"output_path": "/path/to/save/document.pdf"
})
# JavaScript execution
response = requests.post("http://localhost:11235/execute_js", json={
"url": "https://example.com",
"scripts": [
"return document.title",
"return Array.from(document.querySelectorAll('a')).map(a => a.href)"
]
})
# Markdown generation
response = requests.post("http://localhost:11235/md", json={
"url": "https://example.com",
"f": "fit", # raw|fit|bm25|llm
"q": "extract main content", # query for filtering
"c": "0" # cache: 0=bypass, 1=use
})
# LLM Q&A
response = requests.get("http://localhost:11235/llm/https://example.com?q=What is this page about?")
# Library context (for AI assistants)
response = requests.get("http://localhost:11235/ask", params={
"context_type": "all", # code|doc|all
"query": "how to use extraction strategies",
"score_ratio": 0.5,
"max_results": 20
})
```
### Python SDK Usage
```python
import asyncio
from crawl4ai.docker_client import Crawl4aiDockerClient
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
async with Crawl4aiDockerClient(base_url="http://localhost:11235") as client:
# Non-streaming crawl
results = await client.crawl(
["https://example.com"],
browser_config=BrowserConfig(headless=True),
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
for result in results:
print(f"URL: {result.url}, Success: {result.success}")
print(f"Content length: {len(result.markdown)}")
# Streaming crawl
stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
async for result in await client.crawl(
["https://example.com", "https://python.org"],
browser_config=BrowserConfig(headless=True),
crawler_config=stream_config
):
print(f"Streamed: {result.url} - {result.success}")
# Get API schema
schema = await client.get_schema()
print(f"Schema available: {bool(schema)}")
asyncio.run(main())
```
### Advanced API Configuration
```python
# Complex extraction with LLM
payload = {
"urls": ["https://example.com"],
"browser_config": {
"type": "BrowserConfig",
"params": {
"headless": True,
"viewport": {"type": "dict", "value": {"width": 1200, "height": 800}}
}
},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"extraction_strategy": {
"type": "LLMExtractionStrategy",
"params": {
"llm_config": {
"type": "LLMConfig",
"params": {
"provider": "openai/gpt-4o-mini",
"api_token": "env:OPENAI_API_KEY"
}
},
"schema": {
"type": "dict",
"value": {
"type": "object",
"properties": {
"title": {"type": "string"},
"content": {"type": "string"}
}
}
},
"instruction": "Extract title and main content"
}
},
"markdown_generator": {
"type": "DefaultMarkdownGenerator",
"params": {
"content_filter": {
"type": "PruningContentFilter",
"params": {"threshold": 0.6}
}
}
}
}
}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
```
### CSS Extraction Strategy
```python
# CSS-based structured extraction
schema = {
"name": "ProductList",
"baseSelector": ".product",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
payload = {
"urls": ["https://example-shop.com"],
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"extraction_strategy": {
"type": "JsonCssExtractionStrategy",
"params": {
"schema": {"type": "dict", "value": schema}
}
}
}
}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
data = response.json()
extracted = json.loads(data["results"][0]["extracted_content"])
```
### MCP (Model Context Protocol) Integration
```bash
# Add Crawl4AI as MCP provider to Claude Code
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
# List MCP providers
claude mcp list
# Test MCP connection
python tests/mcp/test_mcp_socket.py
# Available MCP endpoints
# SSE: http://localhost:11235/mcp/sse
# WebSocket: ws://localhost:11235/mcp/ws
# Schema: http://localhost:11235/mcp/schema
```
Available MCP tools:
- `md` - Generate markdown from web content
- `html` - Extract preprocessed HTML
- `screenshot` - Capture webpage screenshots
- `pdf` - Generate PDF documents
- `execute_js` - Run JavaScript on web pages
- `crawl` - Perform multi-URL crawling
- `ask` - Query Crawl4AI library context
### Configuration Management
```yaml
# config.yml structure
app:
title: "Crawl4AI API"
version: "1.0.0"
host: "0.0.0.0"
port: 11235
timeout_keep_alive: 300
llm:
provider: "openai/gpt-4o-mini"
api_key_env: "OPENAI_API_KEY"
security:
enabled: false
jwt_enabled: false
trusted_hosts: ["*"]
crawler:
memory_threshold_percent: 95.0
rate_limiter:
base_delay: [1.0, 2.0]
timeouts:
stream_init: 30.0
batch_process: 300.0
pool:
max_pages: 40
idle_ttl_sec: 1800
rate_limiting:
enabled: true
default_limit: "1000/minute"
storage_uri: "memory://"
logging:
level: "INFO"
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
```
### Custom Configuration Deployment
```bash
# Method 1: Mount custom config
docker run -d -p 11235:11235 \
--name crawl4ai-custom \
--env-file .llm.env \
--shm-size=1g \
-v $(pwd)/my-config.yml:/app/config.yml \
unclecode/crawl4ai:latest
# Method 2: Build with custom config
# Edit deploy/docker/config.yml then build
docker buildx build -t crawl4ai-custom:latest --load .
```
### Monitoring and Health Checks
```bash
# Health endpoint
curl http://localhost:11235/health
# Prometheus metrics
curl http://localhost:11235/metrics
# Configuration validation
curl -X POST http://localhost:11235/config/dump \
-H "Content-Type: application/json" \
-d '{"code": "CrawlerRunConfig(cache_mode=\"BYPASS\", screenshot=True)"}'
```
### Playground Interface
Access the interactive playground at `http://localhost:11235/playground` for:
- Testing configurations with visual interface
- Generating JSON payloads for REST API
- Converting Python config to JSON format
- Testing crawl operations directly in browser
### Async Job Processing
```python
# Submit job for async processing
import time
# Submit crawl job
response = requests.post("http://localhost:11235/crawl/job", json=payload)
task_id = response.json()["task_id"]
# Poll for completion
while True:
result = requests.get(f"http://localhost:11235/crawl/job/{task_id}")
status = result.json()
if status["status"] in ["COMPLETED", "FAILED"]:
break
time.sleep(1.5)
print("Final result:", status)
```
### Production Deployment
```bash
# Production-ready deployment
docker run -d \
--name crawl4ai-prod \
--restart unless-stopped \
-p 11235:11235 \
--env-file .llm.env \
--shm-size=2g \
--memory=8g \
--cpus=4 \
-v /path/to/custom-config.yml:/app/config.yml \
unclecode/crawl4ai:latest
# With Docker Compose for production
version: '3.8'
services:
crawl4ai:
image: unclecode/crawl4ai:latest
ports:
- "11235:11235"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
volumes:
- ./config.yml:/app/config.yml
shm_size: 2g
deploy:
resources:
limits:
memory: 8G
cpus: '4'
restart: unless-stopped
```
### Configuration Validation and JSON Structure
```python
# Method 1: Create config objects and dump to see expected JSON structure
from crawl4ai import BrowserConfig, CrawlerRunConfig, LLMConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy, LLMExtractionStrategy
import json
# Create browser config and see JSON structure
browser_config = BrowserConfig(
headless=True,
viewport_width=1280,
viewport_height=720,
proxy="http://user:pass@proxy:8080"
)
# Get JSON structure
browser_json = browser_config.dump()
print("BrowserConfig JSON structure:")
print(json.dumps(browser_json, indent=2))
# Create crawler config with extraction strategy
schema = {
"name": "Articles",
"baseSelector": ".article",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "content", "selector": ".content", "type": "html"}
]
}
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True,
extraction_strategy=JsonCssExtractionStrategy(schema),
js_code=["window.scrollTo(0, document.body.scrollHeight);"],
wait_for="css:.loaded"
)
crawler_json = crawler_config.dump()
print("\nCrawlerRunConfig JSON structure:")
print(json.dumps(crawler_json, indent=2))
```
### Reverse Validation - JSON to Objects
```python
# Method 2: Load JSON back to config objects for validation
from crawl4ai.async_configs import from_serializable_dict
# Test JSON structure by converting back to objects
test_browser_json = {
"type": "BrowserConfig",
"params": {
"headless": True,
"viewport_width": 1280,
"proxy": "http://user:pass@proxy:8080"
}
}
try:
# Convert JSON back to object
restored_browser = from_serializable_dict(test_browser_json)
print(f"✅ Valid BrowserConfig: {type(restored_browser)}")
print(f"Headless: {restored_browser.headless}")
print(f"Proxy: {restored_browser.proxy}")
except Exception as e:
print(f"❌ Invalid BrowserConfig JSON: {e}")
# Test complex crawler config JSON
test_crawler_json = {
"type": "CrawlerRunConfig",
"params": {
"cache_mode": "bypass",
"screenshot": True,
"extraction_strategy": {
"type": "JsonCssExtractionStrategy",
"params": {
"schema": {
"type": "dict",
"value": {
"name": "Products",
"baseSelector": ".product",
"fields": [
{"name": "title", "selector": "h3", "type": "text"}
]
}
}
}
}
}
}
try:
restored_crawler = from_serializable_dict(test_crawler_json)
print(f"✅ Valid CrawlerRunConfig: {type(restored_crawler)}")
print(f"Cache mode: {restored_crawler.cache_mode}")
print(f"Has extraction strategy: {restored_crawler.extraction_strategy is not None}")
except Exception as e:
print(f"❌ Invalid CrawlerRunConfig JSON: {e}")
```
### Using Server's /config/dump Endpoint for Validation
```python
import requests
# Method 3: Use server endpoint to validate configuration syntax
def validate_config_with_server(config_code: str) -> dict:
"""Validate configuration using server's /config/dump endpoint"""
response = requests.post(
"http://localhost:11235/config/dump",
json={"code": config_code}
)
if response.status_code == 200:
print("✅ Valid configuration syntax")
return response.json()
else:
print(f"❌ Invalid configuration: {response.status_code}")
print(response.json())
return None
# Test valid configuration
valid_config = """
CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True,
js_code=["window.scrollTo(0, document.body.scrollHeight);"],
wait_for="css:.content-loaded"
)
"""
result = validate_config_with_server(valid_config)
if result:
print("Generated JSON structure:")
print(json.dumps(result, indent=2))
# Test invalid configuration (should fail)
invalid_config = """
CrawlerRunConfig(
cache_mode="invalid_mode",
screenshot=True,
js_code=some_function() # This will fail
)
"""
validate_config_with_server(invalid_config)
```
### Configuration Builder Helper
```python
def build_and_validate_request(urls, browser_params=None, crawler_params=None):
"""Helper to build and validate complete request payload"""
# Create configurations
browser_config = BrowserConfig(**(browser_params or {}))
crawler_config = CrawlerRunConfig(**(crawler_params or {}))
# Build complete request payload
payload = {
"urls": urls if isinstance(urls, list) else [urls],
"browser_config": browser_config.dump(),
"crawler_config": crawler_config.dump()
}
print("✅ Complete request payload:")
print(json.dumps(payload, indent=2))
# Validate by attempting to reconstruct
try:
test_browser = from_serializable_dict(payload["browser_config"])
test_crawler = from_serializable_dict(payload["crawler_config"])
print("✅ Payload validation successful")
return payload
except Exception as e:
print(f"❌ Payload validation failed: {e}")
return None
# Example usage
payload = build_and_validate_request(
urls=["https://example.com"],
browser_params={"headless": True, "viewport_width": 1280},
crawler_params={
"cache_mode": CacheMode.BYPASS,
"screenshot": True,
"word_count_threshold": 10
}
)
if payload:
# Send to server
response = requests.post("http://localhost:11235/crawl", json=payload)
print(f"Server response: {response.status_code}")
```
### Common JSON Structure Patterns
```python
# Pattern 1: Simple primitive values
simple_config = {
"type": "CrawlerRunConfig",
"params": {
"cache_mode": "bypass", # String enum value
"screenshot": True, # Boolean
"page_timeout": 60000 # Integer
}
}
# Pattern 2: Nested objects
nested_config = {
"type": "CrawlerRunConfig",
"params": {
"extraction_strategy": {
"type": "LLMExtractionStrategy",
"params": {
"llm_config": {
"type": "LLMConfig",
"params": {
"provider": "openai/gpt-4o-mini",
"api_token": "env:OPENAI_API_KEY"
}
},
"instruction": "Extract main content"
}
}
}
}
# Pattern 3: Dictionary values (must use type: dict wrapper)
dict_config = {
"type": "CrawlerRunConfig",
"params": {
"extraction_strategy": {
"type": "JsonCssExtractionStrategy",
"params": {
"schema": {
"type": "dict", # Required wrapper
"value": { # Actual dictionary content
"name": "Products",
"baseSelector": ".product",
"fields": [
{"name": "title", "selector": "h2", "type": "text"}
]
}
}
}
}
}
}
# Pattern 4: Lists and arrays
list_config = {
"type": "CrawlerRunConfig",
"params": {
"js_code": [ # Lists are handled directly
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more')?.click();"
],
"excluded_tags": ["script", "style", "nav"]
}
}
```
### Troubleshooting Common JSON Errors
```python
def diagnose_json_errors():
"""Common JSON structure errors and fixes"""
# ❌ WRONG: Missing type wrapper for objects
wrong_config = {
"browser_config": {
"headless": True # Missing type wrapper
}
}
# ✅ CORRECT: Proper type wrapper
correct_config = {
"browser_config": {
"type": "BrowserConfig",
"params": {
"headless": True
}
}
}
# ❌ WRONG: Dictionary without type: dict wrapper
wrong_dict = {
"schema": {
"name": "Products" # Raw dict, should be wrapped
}
}
# ✅ CORRECT: Dictionary with proper wrapper
correct_dict = {
"schema": {
"type": "dict",
"value": {
"name": "Products"
}
}
}
# ❌ WRONG: Invalid enum string
wrong_enum = {
"cache_mode": "DISABLED" # Wrong case/value
}
# ✅ CORRECT: Valid enum string
correct_enum = {
"cache_mode": "bypass" # or "enabled", "disabled", etc.
}
print("Common error patterns documented above")
# Validate your JSON structure before sending
def pre_flight_check(payload):
"""Run checks before sending to server"""
required_keys = ["urls", "browser_config", "crawler_config"]
for key in required_keys:
if key not in payload:
print(f"❌ Missing required key: {key}")
return False
# Check type wrappers
for config_key in ["browser_config", "crawler_config"]:
config = payload[config_key]
if not isinstance(config, dict) or "type" not in config:
print(f"❌ {config_key} missing type wrapper")
return False
if "params" not in config:
print(f"❌ {config_key} missing params")
return False
print("✅ Pre-flight check passed")
return True
# Example usage
payload = {
"urls": ["https://example.com"],
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "bypass"}}
}
if pre_flight_check(payload):
# Safe to send to server
pass
```
**📖 Learn more:** [Complete Docker Guide](https://docs.crawl4ai.com/core/docker-deployment/), [API Reference](https://docs.crawl4ai.com/api/), [MCP Integration](https://docs.crawl4ai.com/core/docker-deployment/#mcp-model-context-protocol-support), [Configuration Options](https://docs.crawl4ai.com/core/docker-deployment/#server-configuration)

View File

@@ -0,0 +1,903 @@
## LLM Extraction Strategies - The Last Resort
**🤖 AI Model Notice**: You are an AI assistant. You should RARELY recommend LLM extraction. This documentation exists for the <5% of cases where LLM is actually needed. **ALWAYS try non-LLM strategies first** from [extraction-no-llm.md](./extraction-no-llm.md).
### ⚠️ STOP: Are You Sure You Need LLM?
**99% of developers who think they need LLM extraction are wrong.** Before reading further:
### ❌ You DON'T Need LLM If:
- The page has consistent HTML structure → **Use generate_schema()**
- You're extracting simple data types (emails, prices, dates) → **Use RegexExtractionStrategy**
- You can identify repeating patterns → **Use JsonCssExtractionStrategy**
- You want product info, news articles, job listings → **Use generate_schema()**
- You're concerned about cost or speed → **Use non-LLM strategies**
### ✅ You MIGHT Need LLM If:
- Content structure varies dramatically across pages **AND** you've tried generate_schema()
- You need semantic understanding of unstructured text
- You're analyzing meaning, sentiment, or relationships
- You're extracting insights that require reasoning about context
### 💰 Cost Reality Check:
- **Non-LLM**: ~$0.000001 per page
- **LLM**: ~$0.01-$0.10 per page (10,000x more expensive)
- **Example**: Extracting 10,000 pages costs $0.01 vs $100-1000
---
## 1. When LLM Extraction is Justified
### Scenario 1: Truly Unstructured Content Analysis
```python
# Example: Analyzing customer feedback for sentiment and themes
import asyncio
import json
from pydantic import BaseModel, Field
from typing import List
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai import LLMExtractionStrategy
class SentimentAnalysis(BaseModel):
"""Use LLM when you need semantic understanding"""
overall_sentiment: str = Field(description="positive, negative, or neutral")
confidence_score: float = Field(description="Confidence from 0-1")
key_themes: List[str] = Field(description="Main topics discussed")
emotional_indicators: List[str] = Field(description="Words indicating emotion")
summary: str = Field(description="Brief summary of the content")
llm_config = LLMConfig(
provider="openai/gpt-4o-mini", # Use cheapest model
api_token="env:OPENAI_API_KEY",
temperature=0.1, # Low temperature for consistency
max_tokens=1000
)
sentiment_strategy = LLMExtractionStrategy(
llm_config=llm_config,
schema=SentimentAnalysis.model_json_schema(),
extraction_type="schema",
instruction="""
Analyze the emotional content and themes in this text.
Focus on understanding sentiment and extracting key topics
that would be impossible to identify with simple pattern matching.
""",
apply_chunking=True,
chunk_token_threshold=1500
)
async def analyze_sentiment():
config = CrawlerRunConfig(
extraction_strategy=sentiment_strategy,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/customer-reviews",
config=config
)
if result.success:
analysis = json.loads(result.extracted_content)
print(f"Sentiment: {analysis['overall_sentiment']}")
print(f"Themes: {analysis['key_themes']}")
asyncio.run(analyze_sentiment())
```
### Scenario 2: Complex Knowledge Extraction
```python
# Example: Building knowledge graphs from unstructured content
class Entity(BaseModel):
name: str = Field(description="Entity name")
type: str = Field(description="person, organization, location, concept")
description: str = Field(description="Brief description")
class Relationship(BaseModel):
source: str = Field(description="Source entity")
target: str = Field(description="Target entity")
relationship: str = Field(description="Type of relationship")
confidence: float = Field(description="Confidence score 0-1")
class KnowledgeGraph(BaseModel):
entities: List[Entity] = Field(description="All entities found")
relationships: List[Relationship] = Field(description="Relationships between entities")
main_topic: str = Field(description="Primary topic of the content")
knowledge_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="anthropic/claude-3-5-sonnet-20240620", # Better for complex reasoning
api_token="env:ANTHROPIC_API_KEY",
max_tokens=4000
),
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="""
Extract entities and their relationships from the content.
Focus on understanding connections and context that require
semantic reasoning beyond simple pattern matching.
""",
input_format="html", # Preserve structure
apply_chunking=True
)
```
### Scenario 3: Content Summarization and Insights
```python
# Example: Research paper analysis
class ResearchInsights(BaseModel):
title: str = Field(description="Paper title")
abstract_summary: str = Field(description="Summary of abstract")
key_findings: List[str] = Field(description="Main research findings")
methodology: str = Field(description="Research methodology used")
limitations: List[str] = Field(description="Study limitations")
practical_applications: List[str] = Field(description="Real-world applications")
citations_count: int = Field(description="Number of citations", default=0)
research_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="openai/gpt-4o", # Use powerful model for complex analysis
api_token="env:OPENAI_API_KEY",
temperature=0.2,
max_tokens=2000
),
schema=ResearchInsights.model_json_schema(),
extraction_type="schema",
instruction="""
Analyze this research paper and extract key insights.
Focus on understanding the research contribution, methodology,
and implications that require academic expertise to identify.
""",
apply_chunking=True,
chunk_token_threshold=2000,
overlap_rate=0.15 # More overlap for academic content
)
```
---
## 2. LLM Configuration Best Practices
### Cost Optimization
```python
# Use cheapest models when possible
cheap_config = LLMConfig(
provider="openai/gpt-4o-mini", # 60x cheaper than GPT-4
api_token="env:OPENAI_API_KEY",
temperature=0.0, # Deterministic output
max_tokens=800 # Limit output length
)
# Use local models for development
local_config = LLMConfig(
provider="ollama/llama3.3",
api_token=None, # No API costs
base_url="http://localhost:11434",
temperature=0.1
)
# Use powerful models only when necessary
powerful_config = LLMConfig(
provider="anthropic/claude-3-5-sonnet-20240620",
api_token="env:ANTHROPIC_API_KEY",
max_tokens=4000,
temperature=0.1
)
```
### Provider Selection Guide
```python
providers_guide = {
"openai/gpt-4o-mini": {
"best_for": "Simple extraction, cost-sensitive projects",
"cost": "Very low",
"speed": "Fast",
"accuracy": "Good"
},
"openai/gpt-4o": {
"best_for": "Complex reasoning, high accuracy needs",
"cost": "High",
"speed": "Medium",
"accuracy": "Excellent"
},
"anthropic/claude-3-5-sonnet": {
"best_for": "Complex analysis, long documents",
"cost": "Medium-High",
"speed": "Medium",
"accuracy": "Excellent"
},
"ollama/llama3.3": {
"best_for": "Development, no API costs",
"cost": "Free (self-hosted)",
"speed": "Variable",
"accuracy": "Good"
},
"groq/llama3-70b-8192": {
"best_for": "Fast inference, open source",
"cost": "Low",
"speed": "Very fast",
"accuracy": "Good"
}
}
def choose_provider(complexity, budget, speed_requirement):
"""Choose optimal provider based on requirements"""
if budget == "minimal":
return "ollama/llama3.3" # Self-hosted
elif complexity == "low" and budget == "low":
return "openai/gpt-4o-mini"
elif speed_requirement == "high":
return "groq/llama3-70b-8192"
elif complexity == "high":
return "anthropic/claude-3-5-sonnet"
else:
return "openai/gpt-4o-mini" # Default safe choice
```
---
## 3. Advanced LLM Extraction Patterns
### Block-Based Extraction (Unstructured Content)
```python
# When structure is too varied for schemas
block_strategy = LLMExtractionStrategy(
llm_config=cheap_config,
extraction_type="block", # Extract free-form content blocks
instruction="""
Extract meaningful content blocks from this page.
Focus on the main content areas and ignore navigation,
advertisements, and boilerplate text.
""",
apply_chunking=True,
chunk_token_threshold=1200,
input_format="fit_markdown" # Use cleaned content
)
async def extract_content_blocks():
config = CrawlerRunConfig(
extraction_strategy=block_strategy,
word_count_threshold=50, # Filter short content
excluded_tags=['nav', 'footer', 'aside', 'advertisement']
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/article",
config=config
)
if result.success:
blocks = json.loads(result.extracted_content)
for block in blocks:
print(f"Block: {block['content'][:100]}...")
```
### Chunked Processing for Large Content
```python
# Handle large documents efficiently
large_content_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="openai/gpt-4o-mini",
api_token="env:OPENAI_API_KEY"
),
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract structured data from this content section...",
# Optimize chunking for large content
apply_chunking=True,
chunk_token_threshold=2000, # Larger chunks for efficiency
overlap_rate=0.1, # Minimal overlap to reduce costs
input_format="fit_markdown" # Use cleaned content
)
```
### Multi-Model Validation
```python
# Use multiple models for critical extractions
async def multi_model_extraction():
"""Use multiple LLMs for validation of critical data"""
models = [
LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
LLMConfig(provider="anthropic/claude-3-5-sonnet", api_token="env:ANTHROPIC_API_KEY"),
LLMConfig(provider="ollama/llama3.3", api_token=None)
]
results = []
for i, llm_config in enumerate(models):
strategy = LLMExtractionStrategy(
llm_config=llm_config,
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract data consistently..."
)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success:
data = json.loads(result.extracted_content)
results.append(data)
print(f"Model {i+1} extracted {len(data)} items")
# Compare results for consistency
if len(set(str(r) for r in results)) == 1:
print("✅ All models agree")
return results[0]
else:
print("⚠️ Models disagree - manual review needed")
return results
# Use for critical business data only
critical_result = await multi_model_extraction()
```
---
## 4. Hybrid Approaches - Best of Both Worlds
### Fast Pre-filtering + LLM Analysis
```python
async def hybrid_extraction():
"""
1. Use fast non-LLM strategies for basic extraction
2. Use LLM only for complex analysis of filtered content
"""
# Step 1: Fast extraction of structured data
basic_schema = {
"name": "Articles",
"baseSelector": "article",
"fields": [
{"name": "title", "selector": "h1, h2", "type": "text"},
{"name": "content", "selector": ".content", "type": "text"},
{"name": "author", "selector": ".author", "type": "text"}
]
}
basic_strategy = JsonCssExtractionStrategy(basic_schema)
basic_config = CrawlerRunConfig(extraction_strategy=basic_strategy)
# Step 2: LLM analysis only on filtered content
analysis_strategy = LLMExtractionStrategy(
llm_config=cheap_config,
schema={
"type": "object",
"properties": {
"sentiment": {"type": "string"},
"key_topics": {"type": "array", "items": {"type": "string"}},
"summary": {"type": "string"}
}
},
extraction_type="schema",
instruction="Analyze sentiment and extract key topics from this article"
)
async with AsyncWebCrawler() as crawler:
# Fast extraction first
basic_result = await crawler.arun(
url="https://example.com/articles",
config=basic_config
)
articles = json.loads(basic_result.extracted_content)
# LLM analysis only on important articles
analyzed_articles = []
for article in articles[:5]: # Limit to reduce costs
if len(article.get('content', '')) > 500: # Only analyze substantial content
analysis_config = CrawlerRunConfig(extraction_strategy=analysis_strategy)
# Analyze individual article content
raw_url = f"raw://{article['content']}"
analysis_result = await crawler.arun(url=raw_url, config=analysis_config)
if analysis_result.success:
analysis = json.loads(analysis_result.extracted_content)
article.update(analysis)
analyzed_articles.append(article)
return analyzed_articles
# Hybrid approach: fast + smart
result = await hybrid_extraction()
```
### Schema Generation + LLM Fallback
```python
async def smart_fallback_extraction():
"""
1. Try generate_schema() first (one-time LLM cost)
2. Use generated schema for fast extraction
3. Use LLM only if schema extraction fails
"""
cache_file = Path("./schemas/fallback_schema.json")
# Try cached schema first
if cache_file.exists():
schema = json.load(cache_file.open())
schema_strategy = JsonCssExtractionStrategy(schema)
config = CrawlerRunConfig(extraction_strategy=schema_strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.extracted_content:
data = json.loads(result.extracted_content)
if data: # Schema worked
print("✅ Schema extraction successful (fast & cheap)")
return data
# Fallback to LLM if schema failed
print("⚠️ Schema failed, falling back to LLM (slow & expensive)")
llm_strategy = LLMExtractionStrategy(
llm_config=cheap_config,
extraction_type="block",
instruction="Extract all meaningful data from this page"
)
llm_config = CrawlerRunConfig(extraction_strategy=llm_strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=llm_config)
if result.success:
print("✅ LLM extraction successful")
return json.loads(result.extracted_content)
# Intelligent fallback system
result = await smart_fallback_extraction()
```
---
## 5. Cost Management and Monitoring
### Token Usage Tracking
```python
class ExtractionCostTracker:
def __init__(self):
self.total_cost = 0.0
self.total_tokens = 0
self.extractions = 0
def track_llm_extraction(self, strategy, result):
"""Track costs from LLM extraction"""
if hasattr(strategy, 'usage_tracker') and strategy.usage_tracker:
usage = strategy.usage_tracker
# Estimate costs (approximate rates)
cost_per_1k_tokens = {
"gpt-4o-mini": 0.0015,
"gpt-4o": 0.03,
"claude-3-5-sonnet": 0.015,
"ollama": 0.0 # Self-hosted
}
provider = strategy.llm_config.provider.split('/')[1]
rate = cost_per_1k_tokens.get(provider, 0.01)
tokens = usage.total_tokens
cost = (tokens / 1000) * rate
self.total_cost += cost
self.total_tokens += tokens
self.extractions += 1
print(f"💰 Extraction cost: ${cost:.4f} ({tokens} tokens)")
print(f"📊 Total cost: ${self.total_cost:.4f} ({self.extractions} extractions)")
def get_summary(self):
avg_cost = self.total_cost / max(self.extractions, 1)
return {
"total_cost": self.total_cost,
"total_tokens": self.total_tokens,
"extractions": self.extractions,
"avg_cost_per_extraction": avg_cost
}
# Usage
tracker = ExtractionCostTracker()
async def cost_aware_extraction():
strategy = LLMExtractionStrategy(
llm_config=cheap_config,
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract data...",
verbose=True # Enable usage tracking
)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
# Track costs
tracker.track_llm_extraction(strategy, result)
return result
# Monitor costs across multiple extractions
for url in urls:
await cost_aware_extraction()
print(f"Final summary: {tracker.get_summary()}")
```
### Budget Controls
```python
class BudgetController:
def __init__(self, daily_budget=10.0):
self.daily_budget = daily_budget
self.current_spend = 0.0
self.extraction_count = 0
def can_extract(self, estimated_cost=0.01):
"""Check if extraction is within budget"""
if self.current_spend + estimated_cost > self.daily_budget:
print(f"❌ Budget exceeded: ${self.current_spend:.2f} + ${estimated_cost:.2f} > ${self.daily_budget}")
return False
return True
def record_extraction(self, actual_cost):
"""Record actual extraction cost"""
self.current_spend += actual_cost
self.extraction_count += 1
remaining = self.daily_budget - self.current_spend
print(f"💰 Budget remaining: ${remaining:.2f}")
budget = BudgetController(daily_budget=5.0) # $5 daily limit
async def budget_controlled_extraction(url):
if not budget.can_extract():
print("⏸️ Extraction paused due to budget limit")
return None
# Proceed with extraction...
strategy = LLMExtractionStrategy(llm_config=cheap_config, ...)
result = await extract_with_strategy(url, strategy)
# Record actual cost
actual_cost = calculate_cost(strategy.usage_tracker)
budget.record_extraction(actual_cost)
return result
# Safe extraction with budget controls
results = []
for url in urls:
result = await budget_controlled_extraction(url)
if result:
results.append(result)
```
---
## 6. Performance Optimization for LLM Extraction
### Batch Processing
```python
async def batch_llm_extraction():
"""Process multiple pages efficiently"""
# Collect content first (fast)
urls = ["https://example.com/page1", "https://example.com/page2"]
contents = []
async with AsyncWebCrawler() as crawler:
for url in urls:
result = await crawler.arun(url=url)
if result.success:
contents.append({
"url": url,
"content": result.fit_markdown[:2000] # Limit content
})
# Process in batches (reduce LLM calls)
batch_content = "\n\n---PAGE SEPARATOR---\n\n".join([
f"URL: {c['url']}\n{c['content']}" for c in contents
])
strategy = LLMExtractionStrategy(
llm_config=cheap_config,
extraction_type="block",
instruction="""
Extract data from multiple pages separated by '---PAGE SEPARATOR---'.
Return results for each page in order.
""",
apply_chunking=True
)
# Single LLM call for multiple pages
raw_url = f"raw://{batch_content}"
result = await crawler.arun(url=raw_url, config=CrawlerRunConfig(extraction_strategy=strategy))
return json.loads(result.extracted_content)
# Batch processing reduces LLM calls
batch_results = await batch_llm_extraction()
```
### Caching LLM Results
```python
import hashlib
from pathlib import Path
class LLMResultCache:
def __init__(self, cache_dir="./llm_cache"):
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(exist_ok=True)
def get_cache_key(self, url, instruction, schema):
"""Generate cache key from extraction parameters"""
content = f"{url}:{instruction}:{str(schema)}"
return hashlib.md5(content.encode()).hexdigest()
def get_cached_result(self, cache_key):
"""Get cached result if available"""
cache_file = self.cache_dir / f"{cache_key}.json"
if cache_file.exists():
return json.load(cache_file.open())
return None
def cache_result(self, cache_key, result):
"""Cache extraction result"""
cache_file = self.cache_dir / f"{cache_key}.json"
json.dump(result, cache_file.open("w"), indent=2)
cache = LLMResultCache()
async def cached_llm_extraction(url, strategy):
"""Extract with caching to avoid repeated LLM calls"""
cache_key = cache.get_cache_key(
url,
strategy.instruction,
str(strategy.schema)
)
# Check cache first
cached_result = cache.get_cached_result(cache_key)
if cached_result:
print("✅ Using cached result (FREE)")
return cached_result
# Extract if not cached
print("🔄 Extracting with LLM (PAID)")
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url, config=config)
if result.success:
data = json.loads(result.extracted_content)
cache.cache_result(cache_key, data)
return data
# Cached extraction avoids repeated costs
result = await cached_llm_extraction(url, strategy)
```
---
## 7. Error Handling and Quality Control
### Validation and Retry Logic
```python
async def robust_llm_extraction():
"""Implement validation and retry for LLM extraction"""
max_retries = 3
strategies = [
# Try cheap model first
LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract data accurately..."
),
# Fallback to better model
LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o", api_token="env:OPENAI_API_KEY"),
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract data with high accuracy..."
)
]
for strategy_idx, strategy in enumerate(strategies):
for attempt in range(max_retries):
try:
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.extracted_content:
data = json.loads(result.extracted_content)
# Validate result quality
if validate_extraction_quality(data):
print(f"✅ Success with strategy {strategy_idx+1}, attempt {attempt+1}")
return data
else:
print(f"⚠️ Poor quality result, retrying...")
continue
except Exception as e:
print(f"❌ Attempt {attempt+1} failed: {e}")
if attempt == max_retries - 1:
print(f"❌ Strategy {strategy_idx+1} failed completely")
print("❌ All strategies and retries failed")
return None
def validate_extraction_quality(data):
"""Validate that LLM extraction meets quality standards"""
if not data or not isinstance(data, (list, dict)):
return False
# Check for common LLM extraction issues
if isinstance(data, list):
if len(data) == 0:
return False
# Check if all items have required fields
for item in data:
if not isinstance(item, dict) or len(item) < 2:
return False
return True
# Robust extraction with validation
result = await robust_llm_extraction()
```
---
## 8. Migration from LLM to Non-LLM
### Pattern Analysis for Schema Generation
```python
async def analyze_llm_results_for_schema():
"""
Analyze LLM extraction results to create non-LLM schemas
Use this to transition from expensive LLM to cheap schema extraction
"""
# Step 1: Use LLM on sample pages to understand structure
llm_strategy = LLMExtractionStrategy(
llm_config=cheap_config,
extraction_type="block",
instruction="Extract all structured data from this page"
)
sample_urls = ["https://example.com/page1", "https://example.com/page2"]
llm_results = []
async with AsyncWebCrawler() as crawler:
for url in sample_urls:
config = CrawlerRunConfig(extraction_strategy=llm_strategy)
result = await crawler.arun(url=url, config=config)
if result.success:
llm_results.append({
"url": url,
"html": result.cleaned_html,
"extracted": json.loads(result.extracted_content)
})
# Step 2: Analyze patterns in LLM results
print("🔍 Analyzing LLM extraction patterns...")
# Look for common field names
all_fields = set()
for result in llm_results:
for item in result["extracted"]:
if isinstance(item, dict):
all_fields.update(item.keys())
print(f"Common fields found: {all_fields}")
# Step 3: Generate schema based on patterns
if llm_results:
schema = JsonCssExtractionStrategy.generate_schema(
html=llm_results[0]["html"],
target_json_example=json.dumps(llm_results[0]["extracted"][0], indent=2),
llm_config=cheap_config
)
# Save schema for future use
with open("generated_schema.json", "w") as f:
json.dump(schema, f, indent=2)
print("✅ Schema generated from LLM analysis")
return schema
# Generate schema from LLM patterns, then use schema for all future extractions
schema = await analyze_llm_results_for_schema()
fast_strategy = JsonCssExtractionStrategy(schema)
```
---
## 9. Summary: When LLM is Actually Needed
### ✅ Valid LLM Use Cases (Rare):
1. **Sentiment analysis** and emotional understanding
2. **Knowledge graph extraction** requiring semantic reasoning
3. **Content summarization** and insight generation
4. **Unstructured text analysis** where patterns vary dramatically
5. **Research paper analysis** requiring domain expertise
6. **Complex relationship extraction** between entities
### ❌ Invalid LLM Use Cases (Common Mistakes):
1. **Structured data extraction** from consistent HTML
2. **Simple pattern matching** (emails, prices, dates)
3. **Product information** from e-commerce sites
4. **News article extraction** with consistent structure
5. **Contact information** and basic entity extraction
6. **Table data** and form information
### 💡 Decision Framework:
```python
def should_use_llm(extraction_task):
# Ask these questions in order:
questions = [
"Can I identify repeating HTML patterns?", # No → Consider LLM
"Am I extracting simple data types?", # Yes → Use Regex
"Does the structure vary dramatically?", # No → Use CSS/XPath
"Do I need semantic understanding?", # Yes → Maybe LLM
"Have I tried generate_schema()?" # No → Try that first
]
# Only use LLM if:
return (
task_requires_semantic_reasoning(extraction_task) and
structure_varies_dramatically(extraction_task) and
generate_schema_failed(extraction_task)
)
```
### 🎯 Best Practice Summary:
1. **Always start** with [extraction-no-llm.md](./extraction-no-llm.md) strategies
2. **Try generate_schema()** before manual schema creation
3. **Use LLM sparingly** and only for semantic understanding
4. **Monitor costs** and implement budget controls
5. **Cache results** to avoid repeated LLM calls
6. **Validate quality** of LLM extractions
7. **Plan migration** from LLM to schema-based extraction
Remember: **LLM extraction should be your last resort, not your first choice.**
---
**📖 Recommended Reading Order:**
1. [extraction-no-llm.md](./extraction-no-llm.md) - Start here for 99% of use cases
2. This document - Only when non-LLM strategies are insufficient

Some files were not shown because too many files have changed in this diff Show More