Merge branch 'release/v0.7.0' - The Adaptive Intelligence Update
432
docs/md_v2/advanced/adaptive-strategies.md
Normal file
@@ -0,0 +1,432 @@
|
||||
# Advanced Adaptive Strategies
|
||||
|
||||
## Overview
|
||||
|
||||
While the default adaptive crawling configuration works well for most use cases, understanding the underlying strategies and scoring mechanisms allows you to fine-tune the crawler for specific domains and requirements.
|
||||
|
||||
## The Three-Layer Scoring System
|
||||
|
||||
### 1. Coverage Score
|
||||
|
||||
Coverage measures how comprehensively your knowledge base covers the query terms and related concepts.
|
||||
|
||||
#### Mathematical Foundation
|
||||
|
||||
```python
|
||||
Coverage(K, Q) = Σ(t ∈ Q) score(t, K) / |Q|
|
||||
|
||||
where score(t, K) = doc_coverage(t) × (1 + freq_boost(t))
|
||||
```
|
||||
|
||||
#### Components
|
||||
|
||||
- **Document Coverage**: Percentage of documents containing the term
|
||||
- **Frequency Boost**: Logarithmic bonus for term frequency
|
||||
- **Query Decomposition**: Handles multi-word queries intelligently
|
||||
|
||||
#### Tuning Coverage
|
||||
|
||||
```python
|
||||
# For technical documentation with specific terminology
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.85, # Require high coverage
|
||||
top_k_links=5 # Cast wider net
|
||||
)
|
||||
|
||||
# For general topics with synonyms
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.6, # Lower threshold
|
||||
top_k_links=2 # More focused
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Consistency Score
|
||||
|
||||
Consistency evaluates whether the information across pages is coherent and non-contradictory.
|
||||
|
||||
#### How It Works
|
||||
|
||||
1. Extracts key statements from each document
|
||||
2. Compares statements across documents
|
||||
3. Measures agreement vs. contradiction
|
||||
4. Returns normalized score (0-1)
|
||||
|
||||
#### Practical Impact
|
||||
|
||||
- **High consistency (>0.8)**: Information is reliable and coherent
|
||||
- **Medium consistency (0.5-0.8)**: Some variation, but generally aligned
|
||||
- **Low consistency (<0.5)**: Conflicting information, need more sources
|
||||
|
||||
### 3. Saturation Score
|
||||
|
||||
Saturation detects when new pages stop providing novel information.
|
||||
|
||||
#### Detection Algorithm
|
||||
|
||||
```python
|
||||
# Tracks new unique terms per page
|
||||
new_terms_page_1 = 50
|
||||
new_terms_page_2 = 30 # 60% of first
|
||||
new_terms_page_3 = 15 # 50% of second
|
||||
new_terms_page_4 = 5 # 33% of third
|
||||
# Saturation detected: rapidly diminishing returns
|
||||
```
|
||||
|
||||
#### Configuration
|
||||
|
||||
```python
|
||||
config = AdaptiveConfig(
|
||||
min_gain_threshold=0.1 # Stop if <10% new information
|
||||
)
|
||||
```
|
||||
|
||||
## Link Ranking Algorithm
|
||||
|
||||
### Expected Information Gain
|
||||
|
||||
Each uncrawled link is scored based on:
|
||||
|
||||
```python
|
||||
ExpectedGain(link) = Relevance × Novelty × Authority
|
||||
```
|
||||
|
||||
#### 1. Relevance Scoring
|
||||
|
||||
Uses BM25 algorithm on link preview text:
|
||||
|
||||
```python
|
||||
relevance = BM25(link.preview_text, query)
|
||||
```
|
||||
|
||||
Factors:
|
||||
- Term frequency in preview
|
||||
- Inverse document frequency
|
||||
- Preview length normalization
|
||||
|
||||
#### 2. Novelty Estimation
|
||||
|
||||
Measures how different the link appears from already-crawled content:
|
||||
|
||||
```python
|
||||
novelty = 1 - max_similarity(preview, knowledge_base)
|
||||
```
|
||||
|
||||
Prevents crawling duplicate or highly similar pages.
|
||||
|
||||
#### 3. Authority Calculation
|
||||
|
||||
URL structure and domain analysis:
|
||||
|
||||
```python
|
||||
authority = f(domain_rank, url_depth, url_structure)
|
||||
```
|
||||
|
||||
Factors:
|
||||
- Domain reputation
|
||||
- URL depth (fewer slashes = higher authority)
|
||||
- Clean URL structure
|
||||
|
||||
### Custom Link Scoring
|
||||
|
||||
```python
|
||||
class CustomLinkScorer:
|
||||
def score(self, link: Link, query: str, state: CrawlState) -> float:
|
||||
# Prioritize specific URL patterns
|
||||
if "/api/reference/" in link.href:
|
||||
return 2.0 # Double the score
|
||||
|
||||
# Deprioritize certain sections
|
||||
if "/archive/" in link.href:
|
||||
return 0.1 # Reduce score by 90%
|
||||
|
||||
# Default scoring
|
||||
return 1.0
|
||||
|
||||
# Use with adaptive crawler
|
||||
adaptive = AdaptiveCrawler(
|
||||
crawler,
|
||||
config=config,
|
||||
link_scorer=CustomLinkScorer()
|
||||
)
|
||||
```
|
||||
|
||||
## Domain-Specific Configurations
|
||||
|
||||
### Technical Documentation
|
||||
|
||||
```python
|
||||
tech_doc_config = AdaptiveConfig(
|
||||
confidence_threshold=0.85,
|
||||
max_pages=30,
|
||||
top_k_links=3,
|
||||
min_gain_threshold=0.05 # Keep crawling for small gains
|
||||
)
|
||||
```
|
||||
|
||||
Rationale:
|
||||
- High threshold ensures comprehensive coverage
|
||||
- Lower gain threshold captures edge cases
|
||||
- Moderate link following for depth
|
||||
|
||||
### News & Articles
|
||||
|
||||
```python
|
||||
news_config = AdaptiveConfig(
|
||||
confidence_threshold=0.6,
|
||||
max_pages=10,
|
||||
top_k_links=5,
|
||||
min_gain_threshold=0.15 # Stop quickly on repetition
|
||||
)
|
||||
```
|
||||
|
||||
Rationale:
|
||||
- Lower threshold (articles often repeat information)
|
||||
- Higher gain threshold (avoid duplicate stories)
|
||||
- More links per page (explore different perspectives)
|
||||
|
||||
### E-commerce
|
||||
|
||||
```python
|
||||
ecommerce_config = AdaptiveConfig(
|
||||
confidence_threshold=0.7,
|
||||
max_pages=20,
|
||||
top_k_links=2,
|
||||
min_gain_threshold=0.1
|
||||
)
|
||||
```
|
||||
|
||||
Rationale:
|
||||
- Balanced threshold for product variations
|
||||
- Focused link following (avoid infinite products)
|
||||
- Standard gain threshold
|
||||
|
||||
### Research & Academic
|
||||
|
||||
```python
|
||||
research_config = AdaptiveConfig(
|
||||
confidence_threshold=0.9,
|
||||
max_pages=50,
|
||||
top_k_links=4,
|
||||
min_gain_threshold=0.02 # Very low - capture citations
|
||||
)
|
||||
```
|
||||
|
||||
Rationale:
|
||||
- Very high threshold for completeness
|
||||
- Many pages allowed for thorough research
|
||||
- Very low gain threshold to capture references
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Memory Management
|
||||
|
||||
```python
|
||||
# For large crawls, use streaming
|
||||
config = AdaptiveConfig(
|
||||
max_pages=100,
|
||||
save_state=True,
|
||||
state_path="large_crawl.json"
|
||||
)
|
||||
|
||||
# Periodically clean state
|
||||
if len(state.knowledge_base) > 1000:
|
||||
# Keep only most relevant
|
||||
state.knowledge_base = get_top_relevant(state.knowledge_base, 500)
|
||||
```
|
||||
|
||||
### Parallel Processing
|
||||
|
||||
```python
|
||||
# Use multiple start points
|
||||
start_urls = [
|
||||
"https://docs.example.com/intro",
|
||||
"https://docs.example.com/api",
|
||||
"https://docs.example.com/guides"
|
||||
]
|
||||
|
||||
# Crawl in parallel
|
||||
tasks = [
|
||||
adaptive.digest(url, query)
|
||||
for url in start_urls
|
||||
]
|
||||
results = await asyncio.gather(*tasks)
|
||||
```
|
||||
|
||||
### Caching Strategy
|
||||
|
||||
```python
|
||||
# Enable caching for repeated crawls
|
||||
async with AsyncWebCrawler(
|
||||
config=BrowserConfig(
|
||||
cache_mode=CacheMode.ENABLED
|
||||
)
|
||||
) as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
```
|
||||
|
||||
## Debugging & Analysis
|
||||
|
||||
### Enable Verbose Logging
|
||||
|
||||
```python
|
||||
import logging
|
||||
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
adaptive = AdaptiveCrawler(crawler, config, verbose=True)
|
||||
```
|
||||
|
||||
### Analyze Crawl Patterns
|
||||
|
||||
```python
|
||||
# After crawling
|
||||
state = await adaptive.digest(start_url, query)
|
||||
|
||||
# Analyze link selection
|
||||
print("Link selection order:")
|
||||
for i, url in enumerate(state.crawl_order):
|
||||
print(f"{i+1}. {url}")
|
||||
|
||||
# Analyze term discovery
|
||||
print("\nTerm discovery rate:")
|
||||
for i, new_terms in enumerate(state.new_terms_history):
|
||||
print(f"Page {i+1}: {new_terms} new terms")
|
||||
|
||||
# Analyze score progression
|
||||
print("\nScore progression:")
|
||||
print(f"Coverage: {state.metrics['coverage_history']}")
|
||||
print(f"Saturation: {state.metrics['saturation_history']}")
|
||||
```
|
||||
|
||||
### Export for Analysis
|
||||
|
||||
```python
|
||||
# Export detailed metrics
|
||||
import json
|
||||
|
||||
metrics = {
|
||||
"query": query,
|
||||
"total_pages": len(state.crawled_urls),
|
||||
"confidence": adaptive.confidence,
|
||||
"coverage_stats": adaptive.coverage_stats,
|
||||
"crawl_order": state.crawl_order,
|
||||
"term_frequencies": dict(state.term_frequencies),
|
||||
"new_terms_history": state.new_terms_history
|
||||
}
|
||||
|
||||
with open("crawl_analysis.json", "w") as f:
|
||||
json.dump(metrics, f, indent=2)
|
||||
```
|
||||
|
||||
## Custom Strategies
|
||||
|
||||
### Implementing a Custom Strategy
|
||||
|
||||
```python
|
||||
from crawl4ai.adaptive_crawler import BaseStrategy
|
||||
|
||||
class DomainSpecificStrategy(BaseStrategy):
|
||||
def calculate_coverage(self, state: CrawlState) -> float:
|
||||
# Custom coverage calculation
|
||||
# e.g., weight certain terms more heavily
|
||||
pass
|
||||
|
||||
def calculate_consistency(self, state: CrawlState) -> float:
|
||||
# Custom consistency logic
|
||||
# e.g., domain-specific validation
|
||||
pass
|
||||
|
||||
def rank_links(self, links: List[Link], state: CrawlState) -> List[Link]:
|
||||
# Custom link ranking
|
||||
# e.g., prioritize specific URL patterns
|
||||
pass
|
||||
|
||||
# Use custom strategy
|
||||
adaptive = AdaptiveCrawler(
|
||||
crawler,
|
||||
config=config,
|
||||
strategy=DomainSpecificStrategy()
|
||||
)
|
||||
```
|
||||
|
||||
### Combining Strategies
|
||||
|
||||
```python
|
||||
class HybridStrategy(BaseStrategy):
|
||||
def __init__(self):
|
||||
self.strategies = [
|
||||
TechnicalDocStrategy(),
|
||||
SemanticSimilarityStrategy(),
|
||||
URLPatternStrategy()
|
||||
]
|
||||
|
||||
def calculate_confidence(self, state: CrawlState) -> float:
|
||||
# Weighted combination of strategies
|
||||
scores = [s.calculate_confidence(state) for s in self.strategies]
|
||||
weights = [0.5, 0.3, 0.2]
|
||||
return sum(s * w for s, w in zip(scores, weights))
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Start Conservative
|
||||
|
||||
Begin with default settings and adjust based on results:
|
||||
|
||||
```python
|
||||
# Start with defaults
|
||||
result = await adaptive.digest(url, query)
|
||||
|
||||
# Analyze and adjust
|
||||
if adaptive.confidence < 0.7:
|
||||
config.max_pages += 10
|
||||
config.confidence_threshold -= 0.1
|
||||
```
|
||||
|
||||
### 2. Monitor Resource Usage
|
||||
|
||||
```python
|
||||
import psutil
|
||||
|
||||
# Check memory before large crawls
|
||||
memory_percent = psutil.virtual_memory().percent
|
||||
if memory_percent > 80:
|
||||
config.max_pages = min(config.max_pages, 20)
|
||||
```
|
||||
|
||||
### 3. Use Domain Knowledge
|
||||
|
||||
```python
|
||||
# For API documentation
|
||||
if "api" in start_url:
|
||||
config.top_k_links = 2 # APIs have clear structure
|
||||
|
||||
# For blogs
|
||||
if "blog" in start_url:
|
||||
config.min_gain_threshold = 0.2 # Avoid similar posts
|
||||
```
|
||||
|
||||
### 4. Validate Results
|
||||
|
||||
```python
|
||||
# Always validate the knowledge base
|
||||
relevant_content = adaptive.get_relevant_content(top_k=10)
|
||||
|
||||
# Check coverage
|
||||
query_terms = set(query.lower().split())
|
||||
covered_terms = set()
|
||||
|
||||
for doc in relevant_content:
|
||||
content_lower = doc['content'].lower()
|
||||
for term in query_terms:
|
||||
if term in content_lower:
|
||||
covered_terms.add(term)
|
||||
|
||||
coverage_ratio = len(covered_terms) / len(query_terms)
|
||||
print(f"Query term coverage: {coverage_ratio:.0%}")
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
- Explore [Custom Strategy Implementation](../tutorials/custom-adaptive-strategies.md)
|
||||
- Learn about [Knowledge Base Management](../tutorials/knowledge-base-management.md)
|
||||
- See [Performance Benchmarks](../benchmarks/adaptive-performance.md)
|
||||
@@ -66,29 +66,38 @@ Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI c
|
||||
```python
|
||||
import os, asyncio
|
||||
from base64 import b64decode
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
run_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
screenshot=True,
|
||||
pdf=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
pdf=True,
|
||||
screenshot=True
|
||||
config=run_config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
# Save screenshot
|
||||
print(f"Screenshot data present: {result.screenshot is not None}")
|
||||
print(f"PDF data present: {result.pdf is not None}")
|
||||
|
||||
if result.screenshot:
|
||||
print(f"[OK] Screenshot captured, size: {len(result.screenshot)} bytes")
|
||||
with open("wikipedia_screenshot.png", "wb") as f:
|
||||
f.write(b64decode(result.screenshot))
|
||||
|
||||
# Save PDF
|
||||
else:
|
||||
print("[WARN] Screenshot data is None.")
|
||||
|
||||
if result.pdf:
|
||||
print(f"[OK] PDF captured, size: {len(result.pdf)} bytes")
|
||||
with open("wikipedia_page.pdf", "wb") as f:
|
||||
f.write(result.pdf)
|
||||
|
||||
print("[OK] PDF & screenshot captured.")
|
||||
else:
|
||||
print("[WARN] PDF data is None.")
|
||||
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
|
||||
@@ -6,7 +6,7 @@ Many websites now load images **lazily** as you scroll. If you need to ensure th
|
||||
2. **`scan_full_page`** – Force the crawler to scroll the entire page, triggering lazy loads.
|
||||
3. **`scroll_delay`** – Add small delays between scroll steps.
|
||||
|
||||
**Note**: If the site requires multiple “Load More” triggers or complex interactions, see the [Page Interaction docs](../core/page-interaction.md).
|
||||
**Note**: If the site requires multiple “Load More” triggers or complex interactions, see the [Page Interaction docs](../core/page-interaction.md). For sites with virtual scrolling (Twitter/Instagram style), see the [Virtual Scroll docs](virtual-scroll.md).
|
||||
|
||||
### Example: Ensuring Lazy Images Appear
|
||||
|
||||
|
||||
@@ -172,7 +172,7 @@ dispatcher = MemoryAdaptiveDispatcher(
|
||||
3. **`max_session_permit`** (`int`, default: `10`)
|
||||
The maximum number of concurrent crawling tasks allowed. This ensures resource limits are respected while maintaining concurrency.
|
||||
|
||||
4. **`memory_wait_timeout`** (`float`, default: `300.0`)
|
||||
4. **`memory_wait_timeout`** (`float`, default: `600.0`)
|
||||
Optional timeout (in seconds). If memory usage exceeds `memory_threshold_percent` for longer than this duration, a `MemoryError` is raised.
|
||||
|
||||
5. **`rate_limiter`** (`RateLimiter`, default: `None`)
|
||||
|
||||
201
docs/md_v2/advanced/pdf-parsing.md
Normal file
@@ -0,0 +1,201 @@
|
||||
# PDF Processing Strategies
|
||||
|
||||
Crawl4AI provides specialized strategies for handling and extracting content from PDF files. These strategies allow you to seamlessly integrate PDF processing into your crawling workflows, whether the PDFs are hosted online or stored locally.
|
||||
|
||||
## `PDFCrawlerStrategy`
|
||||
|
||||
### Overview
|
||||
`PDFCrawlerStrategy` is an implementation of `AsyncCrawlerStrategy` designed specifically for PDF documents. Instead of interpreting the input URL as an HTML webpage, this strategy treats it as a pointer to a PDF file. It doesn't perform deep crawling or HTML parsing itself but rather prepares the PDF source for a dedicated PDF scraping strategy. Its primary role is to identify the PDF source (web URL or local file) and pass it along the processing pipeline in a way that `AsyncWebCrawler` can handle.
|
||||
|
||||
### When to Use
|
||||
Use `PDFCrawlerStrategy` when you need to:
|
||||
- Process PDF files using the `AsyncWebCrawler`.
|
||||
- Handle PDFs from both web URLs (e.g., `https://example.com/document.pdf`) and local file paths (e.g., `file:///path/to/your/document.pdf`).
|
||||
- Integrate PDF content extraction into a unified `CrawlResult` object, allowing consistent handling of PDF data alongside web page data.
|
||||
|
||||
### Key Methods and Their Behavior
|
||||
- **`__init__(self, logger: AsyncLogger = None)`**:
|
||||
- Initializes the strategy.
|
||||
- `logger`: An optional `AsyncLogger` instance (from `crawl4ai.async_logger`) for logging purposes.
|
||||
- **`async crawl(self, url: str, **kwargs) -> AsyncCrawlResponse`**:
|
||||
- This method is called by the `AsyncWebCrawler` during the `arun` process.
|
||||
- It takes the `url` (which should point to a PDF) and creates a minimal `AsyncCrawlResponse`.
|
||||
- The `html` attribute of this response is typically empty or a placeholder, as the actual PDF content processing is deferred to the `PDFContentScrapingStrategy` (or a similar PDF-aware scraping strategy).
|
||||
- It sets `response_headers` to indicate "application/pdf" and `status_code` to 200.
|
||||
- **`async close(self)`**:
|
||||
- A method for cleaning up any resources used by the strategy. For `PDFCrawlerStrategy`, this is usually minimal.
|
||||
- **`async __aenter__(self)` / `async __aexit__(self, exc_type, exc_val, exc_tb)`**:
|
||||
- Enables asynchronous context management for the strategy, allowing it to be used with `async with`.
|
||||
|
||||
### Example Usage
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
|
||||
|
||||
async def main():
|
||||
# Initialize the PDF crawler strategy
|
||||
pdf_crawler_strategy = PDFCrawlerStrategy()
|
||||
|
||||
# PDFCrawlerStrategy is typically used in conjunction with PDFContentScrapingStrategy
|
||||
# The scraping strategy handles the actual PDF content extraction
|
||||
pdf_scraping_strategy = PDFContentScrapingStrategy()
|
||||
run_config = CrawlerRunConfig(scraping_strategy=pdf_scraping_strategy)
|
||||
|
||||
async with AsyncWebCrawler(crawler_strategy=pdf_crawler_strategy) as crawler:
|
||||
# Example with a remote PDF URL
|
||||
pdf_url = "https://arxiv.org/pdf/2310.06825.pdf" # A public PDF from arXiv
|
||||
|
||||
print(f"Attempting to process PDF: {pdf_url}")
|
||||
result = await crawler.arun(url=pdf_url, config=run_config)
|
||||
|
||||
if result.success:
|
||||
print(f"Successfully processed PDF: {result.url}")
|
||||
print(f"Metadata Title: {result.metadata.get('title', 'N/A')}")
|
||||
# Further processing of result.markdown, result.media, etc.
|
||||
# would be done here, based on what PDFContentScrapingStrategy extracts.
|
||||
if result.markdown and hasattr(result.markdown, 'raw_markdown'):
|
||||
print(f"Extracted text (first 200 chars): {result.markdown.raw_markdown[:200]}...")
|
||||
else:
|
||||
print("No markdown (text) content extracted.")
|
||||
else:
|
||||
print(f"Failed to process PDF: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Pros and Cons
|
||||
**Pros:**
|
||||
- Enables `AsyncWebCrawler` to handle PDF sources directly using familiar `arun` calls.
|
||||
- Provides a consistent interface for specifying PDF sources (URLs or local paths).
|
||||
- Abstracts the source handling, allowing a separate scraping strategy to focus on PDF content parsing.
|
||||
|
||||
**Cons:**
|
||||
- Does not perform any PDF data extraction itself; it strictly relies on a compatible scraping strategy (like `PDFContentScrapingStrategy`) to process the PDF.
|
||||
- Has limited utility on its own; most of its value comes from being paired with a PDF-specific content scraping strategy.
|
||||
|
||||
---
|
||||
|
||||
## `PDFContentScrapingStrategy`
|
||||
|
||||
### Overview
|
||||
`PDFContentScrapingStrategy` is an implementation of `ContentScrapingStrategy` designed to extract text, metadata, and optionally images from PDF documents. It is intended to be used in conjunction with a crawler strategy that can provide it with a PDF source, such as `PDFCrawlerStrategy`. This strategy uses the `NaivePDFProcessorStrategy` internally to perform the low-level PDF parsing.
|
||||
|
||||
### When to Use
|
||||
Use `PDFContentScrapingStrategy` when your `AsyncWebCrawler` (often configured with `PDFCrawlerStrategy`) needs to:
|
||||
- Extract textual content page by page from a PDF document.
|
||||
- Retrieve standard metadata embedded within the PDF (e.g., title, author, subject, creation date, page count).
|
||||
- Optionally, extract images contained within the PDF pages. These images can be saved to a local directory or made available for further processing.
|
||||
- Produce a `ScrapingResult` that can be converted into a `CrawlResult`, making PDF content accessible in a manner similar to HTML web content (e.g., text in `result.markdown`, metadata in `result.metadata`).
|
||||
|
||||
### Key Configuration Attributes
|
||||
When initializing `PDFContentScrapingStrategy`, you can configure its behavior using the following attributes:
|
||||
- **`extract_images: bool = False`**: If `True`, the strategy will attempt to extract images from the PDF.
|
||||
- **`save_images_locally: bool = False`**: If `True` (and `extract_images` is also `True`), extracted images will be saved to disk in the `image_save_dir`. If `False`, image data might be available in another form (e.g., base64, depending on the underlying processor) but not saved as separate files by this strategy.
|
||||
- **`image_save_dir: str = None`**: Specifies the directory where extracted images should be saved if `save_images_locally` is `True`. If `None`, a default or temporary directory might be used.
|
||||
- **`batch_size: int = 4`**: Defines how many PDF pages are processed in a single batch. This can be useful for managing memory when dealing with very large PDF documents.
|
||||
- **`logger: AsyncLogger = None`**: An optional `AsyncLogger` instance for logging.
|
||||
|
||||
### Key Methods and Their Behavior
|
||||
- **`__init__(self, save_images_locally: bool = False, extract_images: bool = False, image_save_dir: str = None, batch_size: int = 4, logger: AsyncLogger = None)`**:
|
||||
- Initializes the strategy with configurations for image handling, batch processing, and logging. It sets up an internal `NaivePDFProcessorStrategy` instance which performs the actual PDF parsing.
|
||||
- **`scrap(self, url: str, html: str, **params) -> ScrapingResult`**:
|
||||
- This is the primary synchronous method called by the crawler (via `ascrap`) to process the PDF.
|
||||
- `url`: The path or URL to the PDF file (provided by `PDFCrawlerStrategy` or similar).
|
||||
- `html`: Typically an empty string when used with `PDFCrawlerStrategy`, as the content is a PDF, not HTML.
|
||||
- It first ensures the PDF is accessible locally (downloads it to a temporary file if `url` is remote).
|
||||
- It then uses its internal PDF processor to extract text, metadata, and images (if configured).
|
||||
- The extracted information is compiled into a `ScrapingResult` object:
|
||||
- `cleaned_html`: Contains an HTML-like representation of the PDF, where each page's content is often wrapped in a `<div>` with page number information.
|
||||
- `media`: A dictionary where `media["images"]` will contain information about extracted images if `extract_images` was `True`.
|
||||
- `links`: A dictionary where `links["urls"]` can contain URLs found within the PDF content.
|
||||
- `metadata`: A dictionary holding PDF metadata (e.g., title, author, num_pages).
|
||||
- **`async ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult`**:
|
||||
- The asynchronous version of `scrap`. Under the hood, it typically runs the synchronous `scrap` method in a separate thread using `asyncio.to_thread` to avoid blocking the event loop.
|
||||
- **`_get_pdf_path(self, url: str) -> str`**:
|
||||
- A private helper method to manage PDF file access. If the `url` is remote (http/https), it downloads the PDF to a temporary local file and returns its path. If `url` indicates a local file (`file://` or a direct path), it resolves and returns the local path.
|
||||
|
||||
### Example Usage
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
|
||||
import os # For creating image directory
|
||||
|
||||
async def main():
|
||||
# Define the directory for saving extracted images
|
||||
image_output_dir = "./my_pdf_images"
|
||||
os.makedirs(image_output_dir, exist_ok=True)
|
||||
|
||||
# Configure the PDF content scraping strategy
|
||||
# Enable image extraction and specify where to save them
|
||||
pdf_scraping_cfg = PDFContentScrapingStrategy(
|
||||
extract_images=True,
|
||||
save_images_locally=True,
|
||||
image_save_dir=image_output_dir,
|
||||
batch_size=2 # Process 2 pages at a time for demonstration
|
||||
)
|
||||
|
||||
# The PDFCrawlerStrategy is needed to tell AsyncWebCrawler how to "crawl" a PDF
|
||||
pdf_crawler_cfg = PDFCrawlerStrategy()
|
||||
|
||||
# Configure the overall crawl run
|
||||
run_cfg = CrawlerRunConfig(
|
||||
scraping_strategy=pdf_scraping_cfg # Use our PDF scraping strategy
|
||||
)
|
||||
|
||||
# Initialize the crawler with the PDF-specific crawler strategy
|
||||
async with AsyncWebCrawler(crawler_strategy=pdf_crawler_cfg) as crawler:
|
||||
pdf_url = "https://arxiv.org/pdf/2310.06825.pdf" # Example PDF
|
||||
|
||||
print(f"Starting PDF processing for: {pdf_url}")
|
||||
result = await crawler.arun(url=pdf_url, config=run_cfg)
|
||||
|
||||
if result.success:
|
||||
print("\n--- PDF Processing Successful ---")
|
||||
print(f"Processed URL: {result.url}")
|
||||
|
||||
print("\n--- Metadata ---")
|
||||
for key, value in result.metadata.items():
|
||||
print(f" {key.replace('_', ' ').title()}: {value}")
|
||||
|
||||
if result.markdown and hasattr(result.markdown, 'raw_markdown'):
|
||||
print(f"\n--- Extracted Text (Markdown Snippet) ---")
|
||||
print(result.markdown.raw_markdown[:500].strip() + "...")
|
||||
else:
|
||||
print("\nNo text (markdown) content extracted.")
|
||||
|
||||
if result.media and result.media.get("images"):
|
||||
print(f"\n--- Image Extraction ---")
|
||||
print(f"Extracted {len(result.media['images'])} image(s).")
|
||||
for i, img_info in enumerate(result.media["images"][:2]): # Show info for first 2 images
|
||||
print(f" Image {i+1}:")
|
||||
print(f" Page: {img_info.get('page')}")
|
||||
print(f" Format: {img_info.get('format', 'N/A')}")
|
||||
if img_info.get('path'):
|
||||
print(f" Saved at: {img_info.get('path')}")
|
||||
else:
|
||||
print("\nNo images were extracted (or extract_images was False).")
|
||||
else:
|
||||
print(f"\n--- PDF Processing Failed ---")
|
||||
print(f"Error: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Pros and Cons
|
||||
|
||||
**Pros:**
|
||||
- Provides a comprehensive way to extract text, metadata, and (optionally) images from PDF documents.
|
||||
- Handles both remote PDFs (via URL) and local PDF files.
|
||||
- Configurable image extraction allows saving images to disk or accessing their data.
|
||||
- Integrates smoothly with the `CrawlResult` object structure, making PDF-derived data accessible in a way consistent with web-scraped data.
|
||||
- The `batch_size` parameter can help in managing memory consumption when processing large or numerous PDF pages.
|
||||
|
||||
**Cons:**
|
||||
- Extraction quality and performance can vary significantly depending on the PDF's complexity, encoding, and whether it's image-based (scanned) or text-based.
|
||||
- Image extraction can be resource-intensive (both CPU and disk space if `save_images_locally` is true).
|
||||
- Relies on `NaivePDFProcessorStrategy` internally, which might have limitations with very complex layouts, encrypted PDFs, or forms compared to more sophisticated PDF parsing libraries. Scanned PDFs will not yield text unless an OCR step is performed (which is not part of this strategy by default).
|
||||
- Link extraction from PDFs can be basic and depends on how hyperlinks are embedded in the document.
|
||||
@@ -25,44 +25,70 @@ Use an authenticated proxy with `BrowserConfig`:
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
proxy_config = {
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "user",
|
||||
"password": "pass"
|
||||
}
|
||||
|
||||
browser_config = BrowserConfig(proxy_config=proxy_config)
|
||||
browser_config = BrowserConfig(proxy="http://[username]:[password]@[host]:[port]")
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
```
|
||||
|
||||
Here's the corrected documentation:
|
||||
|
||||
## Rotating Proxies
|
||||
|
||||
Example using a proxy rotation service dynamically:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def get_next_proxy():
|
||||
# Your proxy rotation logic here
|
||||
return {"server": "http://next.proxy.com:8080"}
|
||||
|
||||
import re
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CrawlerRunConfig,
|
||||
CacheMode,
|
||||
RoundRobinProxyStrategy,
|
||||
)
|
||||
import asyncio
|
||||
from crawl4ai import ProxyConfig
|
||||
async def main():
|
||||
browser_config = BrowserConfig()
|
||||
run_config = CrawlerRunConfig()
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# For each URL, create a new run config with different proxy
|
||||
for url in urls:
|
||||
proxy = await get_next_proxy()
|
||||
# Clone the config and update proxy - this creates a new browser context
|
||||
current_config = run_config.clone(proxy_config=proxy)
|
||||
result = await crawler.arun(url=url, config=current_config)
|
||||
# Load proxies and create rotation strategy
|
||||
proxies = ProxyConfig.from_env()
|
||||
#eg: export PROXIES="ip1:port1:username1:password1,ip2:port2:username2:password2"
|
||||
if not proxies:
|
||||
print("No proxies found in environment. Set PROXIES env variable!")
|
||||
return
|
||||
|
||||
proxy_strategy = RoundRobinProxyStrategy(proxies)
|
||||
|
||||
# Create configs
|
||||
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||
run_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_rotation_strategy=proxy_strategy
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
urls = ["https://httpbin.org/ip"] * (len(proxies) * 2) # Test each proxy twice
|
||||
|
||||
print("\n📈 Initializing crawler with proxy rotation...")
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
print("\n🚀 Starting batch crawl with proxy rotation...")
|
||||
results = await crawler.arun_many(
|
||||
urls=urls,
|
||||
config=run_config
|
||||
)
|
||||
for result in results:
|
||||
if result.success:
|
||||
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
|
||||
current_proxy = run_config.proxy_config if run_config.proxy_config else None
|
||||
|
||||
if current_proxy and ip_match:
|
||||
print(f"URL {result.url}")
|
||||
print(f"Proxy {current_proxy.server} -> Response IP: {ip_match.group(0)}")
|
||||
verified = ip_match.group(0) == current_proxy.ip
|
||||
if verified:
|
||||
print(f"✅ Proxy working! IP matches: {current_proxy.ip}")
|
||||
else:
|
||||
print("❌ Proxy failed or IP mismatch!")
|
||||
print("---")
|
||||
|
||||
asyncio.run(main())
|
||||
|
||||
if __name__ == "__main__":
|
||||
import asyncio
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
|
||||
@@ -45,7 +45,7 @@ Here's an example of crawling GitHub commits across multiple pages while preserv
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
from crawl4ai.cache_context import CacheMode
|
||||
|
||||
async def crawl_dynamic_content():
|
||||
|
||||
310
docs/md_v2/advanced/virtual-scroll.md
Normal file
@@ -0,0 +1,310 @@
|
||||
# Virtual Scroll
|
||||
|
||||
Modern websites increasingly use **virtual scrolling** (also called windowed rendering or viewport rendering) to handle large datasets efficiently. This technique only renders visible items in the DOM, replacing content as users scroll. Popular examples include Twitter's timeline, Instagram's feed, and many data tables.
|
||||
|
||||
Crawl4AI's Virtual Scroll feature automatically detects and handles these scenarios, ensuring you capture **all content**, not just what's initially visible.
|
||||
|
||||
## Understanding Virtual Scroll
|
||||
|
||||
### The Problem
|
||||
|
||||
Traditional infinite scroll **appends** new content to existing content. Virtual scroll **replaces** content to maintain performance:
|
||||
|
||||
```
|
||||
Traditional Scroll: Virtual Scroll:
|
||||
┌─────────────┐ ┌─────────────┐
|
||||
│ Item 1 │ │ Item 11 │ <- Items 1-10 removed
|
||||
│ Item 2 │ │ Item 12 │ <- Only visible items
|
||||
│ ... │ │ Item 13 │ in DOM
|
||||
│ Item 10 │ │ Item 14 │
|
||||
│ Item 11 NEW │ │ Item 15 │
|
||||
│ Item 12 NEW │ └─────────────┘
|
||||
└─────────────┘
|
||||
DOM keeps growing DOM size stays constant
|
||||
```
|
||||
|
||||
Without proper handling, crawlers only capture the currently visible items, missing the rest of the content.
|
||||
|
||||
### Three Scrolling Scenarios
|
||||
|
||||
Crawl4AI's Virtual Scroll detects and handles three scenarios:
|
||||
|
||||
1. **No Change** - Content doesn't update on scroll (static page or end reached)
|
||||
2. **Content Appended** - New items added to existing ones (traditional infinite scroll)
|
||||
3. **Content Replaced** - Items replaced with new ones (true virtual scroll)
|
||||
|
||||
Only scenario 3 requires special handling, which Virtual Scroll automates.
|
||||
|
||||
## Basic Usage
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, VirtualScrollConfig
|
||||
|
||||
# Configure virtual scroll
|
||||
virtual_config = VirtualScrollConfig(
|
||||
container_selector="#feed", # CSS selector for scrollable container
|
||||
scroll_count=20, # Number of scrolls to perform
|
||||
scroll_by="container_height", # How much to scroll each time
|
||||
wait_after_scroll=0.5 # Wait time (seconds) after each scroll
|
||||
)
|
||||
|
||||
# Use in crawler configuration
|
||||
config = CrawlerRunConfig(
|
||||
virtual_scroll_config=virtual_config
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
# result.html contains ALL items from the virtual scroll
|
||||
```
|
||||
|
||||
## Configuration Parameters
|
||||
|
||||
### VirtualScrollConfig
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `container_selector` | `str` | Required | CSS selector for the scrollable container |
|
||||
| `scroll_count` | `int` | `10` | Maximum number of scrolls to perform |
|
||||
| `scroll_by` | `str` or `int` | `"container_height"` | Scroll amount per step |
|
||||
| `wait_after_scroll` | `float` | `0.5` | Seconds to wait after each scroll |
|
||||
|
||||
### Scroll By Options
|
||||
|
||||
- `"container_height"` - Scroll by the container's visible height
|
||||
- `"page_height"` - Scroll by the viewport height
|
||||
- `500` (integer) - Scroll by exact pixel amount
|
||||
|
||||
## Real-World Examples
|
||||
|
||||
### Twitter-like Timeline
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, VirtualScrollConfig, BrowserConfig
|
||||
|
||||
async def crawl_twitter_timeline():
|
||||
# Twitter replaces tweets as you scroll
|
||||
virtual_config = VirtualScrollConfig(
|
||||
container_selector="[data-testid='primaryColumn']",
|
||||
scroll_count=30,
|
||||
scroll_by="container_height",
|
||||
wait_after_scroll=1.0 # Twitter needs time to load
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
virtual_scroll_config=virtual_config,
|
||||
# Optional: Set headless=False to watch it work
|
||||
# browser_config=BrowserConfig(headless=False)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://twitter.com/search?q=AI",
|
||||
config=config
|
||||
)
|
||||
|
||||
# Extract tweet count
|
||||
import re
|
||||
tweets = re.findall(r'data-testid="tweet"', result.html)
|
||||
print(f"Captured {len(tweets)} tweets")
|
||||
```
|
||||
|
||||
### Instagram Grid
|
||||
|
||||
```python
|
||||
async def crawl_instagram_grid():
|
||||
# Instagram uses virtualized grid for performance
|
||||
virtual_config = VirtualScrollConfig(
|
||||
container_selector="article", # Main feed container
|
||||
scroll_count=50, # More scrolls for grid layout
|
||||
scroll_by=800, # Fixed pixel scrolling
|
||||
wait_after_scroll=0.8
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
virtual_scroll_config=virtual_config,
|
||||
screenshot=True # Capture final state
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.instagram.com/explore/tags/photography/",
|
||||
config=config
|
||||
)
|
||||
|
||||
# Count posts
|
||||
posts = result.html.count('class="post"')
|
||||
print(f"Captured {posts} posts from virtualized grid")
|
||||
```
|
||||
|
||||
### Mixed Content (News Feed)
|
||||
|
||||
Some sites mix static and virtualized content:
|
||||
|
||||
```python
|
||||
async def crawl_mixed_feed():
|
||||
# Featured articles stay, regular articles virtualize
|
||||
virtual_config = VirtualScrollConfig(
|
||||
container_selector=".main-feed",
|
||||
scroll_count=25,
|
||||
scroll_by="container_height",
|
||||
wait_after_scroll=0.5
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
virtual_scroll_config=virtual_config
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://news.example.com",
|
||||
config=config
|
||||
)
|
||||
|
||||
# Featured articles remain throughout
|
||||
featured = result.html.count('class="featured-article"')
|
||||
regular = result.html.count('class="regular-article"')
|
||||
|
||||
print(f"Featured (static): {featured}")
|
||||
print(f"Regular (virtualized): {regular}")
|
||||
```
|
||||
|
||||
## Virtual Scroll vs scan_full_page
|
||||
|
||||
Both features handle dynamic content, but serve different purposes:
|
||||
|
||||
| Feature | Virtual Scroll | scan_full_page |
|
||||
|---------|---------------|----------------|
|
||||
| **Purpose** | Capture content that's replaced during scroll | Load content that's appended during scroll |
|
||||
| **Use Case** | Twitter, Instagram, virtual tables | Traditional infinite scroll, lazy-loaded images |
|
||||
| **DOM Behavior** | Replaces elements | Adds elements |
|
||||
| **Memory Usage** | Efficient (merges content) | Can grow large |
|
||||
| **Configuration** | Requires container selector | Works on full page |
|
||||
|
||||
### When to Use Which?
|
||||
|
||||
Use **Virtual Scroll** when:
|
||||
- Content disappears as you scroll (Twitter timeline)
|
||||
- DOM element count stays relatively constant
|
||||
- You need ALL items from a virtualized list
|
||||
- Container-based scrolling (not full page)
|
||||
|
||||
Use **scan_full_page** when:
|
||||
- Content accumulates as you scroll
|
||||
- Images load lazily
|
||||
- Simple "load more" behavior
|
||||
- Full page scrolling
|
||||
|
||||
## Combining with Extraction
|
||||
|
||||
Virtual Scroll works seamlessly with extraction strategies:
|
||||
|
||||
```python
|
||||
from crawl4ai import LLMExtractionStrategy
|
||||
|
||||
# Define extraction schema
|
||||
schema = {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"author": {"type": "string"},
|
||||
"content": {"type": "string"},
|
||||
"timestamp": {"type": "string"}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Configure both virtual scroll and extraction
|
||||
config = CrawlerRunConfig(
|
||||
virtual_scroll_config=VirtualScrollConfig(
|
||||
container_selector="#timeline",
|
||||
scroll_count=20
|
||||
),
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o-mini",
|
||||
schema=schema
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="...", config=config)
|
||||
|
||||
# Extracted data from ALL scrolled content
|
||||
import json
|
||||
posts = json.loads(result.extracted_content)
|
||||
print(f"Extracted {len(posts)} posts from virtual scroll")
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Container Selection**: Be specific with selectors. Using the correct container improves performance.
|
||||
|
||||
2. **Scroll Count**: Start conservative and increase as needed:
|
||||
```python
|
||||
# Start with fewer scrolls
|
||||
virtual_config = VirtualScrollConfig(
|
||||
container_selector="#feed",
|
||||
scroll_count=10 # Test with 10, increase if needed
|
||||
)
|
||||
```
|
||||
|
||||
3. **Wait Times**: Adjust based on site speed:
|
||||
```python
|
||||
# Fast sites
|
||||
wait_after_scroll=0.2
|
||||
|
||||
# Slower sites or heavy content
|
||||
wait_after_scroll=1.5
|
||||
```
|
||||
|
||||
4. **Debug Mode**: Set `headless=False` to watch scrolling:
|
||||
```python
|
||||
browser_config = BrowserConfig(headless=False)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Watch the scrolling happen
|
||||
```
|
||||
|
||||
## How It Works Internally
|
||||
|
||||
1. **Detection Phase**: Scrolls and compares HTML to detect behavior
|
||||
2. **Capture Phase**: For replaced content, stores HTML chunks at each position
|
||||
3. **Merge Phase**: Combines all chunks, removing duplicates based on text content
|
||||
4. **Result**: Complete HTML with all unique items
|
||||
|
||||
The deduplication uses normalized text (lowercase, no spaces/symbols) to ensure accurate merging without false positives.
|
||||
|
||||
## Error Handling
|
||||
|
||||
Virtual Scroll handles errors gracefully:
|
||||
|
||||
```python
|
||||
# If container not found or scrolling fails
|
||||
result = await crawler.arun(url="...", config=config)
|
||||
|
||||
if result.success:
|
||||
# Virtual scroll worked or wasn't needed
|
||||
print(f"Captured {len(result.html)} characters")
|
||||
else:
|
||||
# Crawl failed entirely
|
||||
print(f"Error: {result.error_message}")
|
||||
```
|
||||
|
||||
If the container isn't found, crawling continues normally without virtual scroll.
|
||||
|
||||
## Complete Example
|
||||
|
||||
See our [comprehensive example](/docs/examples/virtual_scroll_example.py) that demonstrates:
|
||||
- Twitter-like feeds
|
||||
- Instagram grids
|
||||
- Traditional infinite scroll
|
||||
- Mixed content scenarios
|
||||
- Performance comparisons
|
||||
|
||||
```bash
|
||||
# Run the examples
|
||||
cd docs/examples
|
||||
python virtual_scroll_example.py
|
||||
```
|
||||
|
||||
The example includes a local test server with different scrolling behaviors for experimentation.
|
||||
244
docs/md_v2/api/adaptive-crawler.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# AdaptiveCrawler
|
||||
|
||||
The `AdaptiveCrawler` class implements intelligent web crawling that automatically determines when sufficient information has been gathered to answer a query. It uses a three-layer scoring system to evaluate coverage, consistency, and saturation.
|
||||
|
||||
## Constructor
|
||||
|
||||
```python
|
||||
AdaptiveCrawler(
|
||||
crawler: AsyncWebCrawler,
|
||||
config: Optional[AdaptiveConfig] = None
|
||||
)
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
- **crawler** (`AsyncWebCrawler`): The underlying web crawler instance to use for fetching pages
|
||||
- **config** (`Optional[AdaptiveConfig]`): Configuration settings for adaptive crawling behavior. If not provided, uses default settings.
|
||||
|
||||
## Primary Method
|
||||
|
||||
### digest()
|
||||
|
||||
The main method that performs adaptive crawling starting from a URL with a specific query.
|
||||
|
||||
```python
|
||||
async def digest(
|
||||
start_url: str,
|
||||
query: str,
|
||||
resume_from: Optional[Union[str, Path]] = None
|
||||
) -> CrawlState
|
||||
```
|
||||
|
||||
#### Parameters
|
||||
|
||||
- **start_url** (`str`): The starting URL for crawling
|
||||
- **query** (`str`): The search query that guides the crawling process
|
||||
- **resume_from** (`Optional[Union[str, Path]]`): Path to a saved state file to resume from
|
||||
|
||||
#### Returns
|
||||
|
||||
- **CrawlState**: The final crawl state containing all crawled URLs, knowledge base, and metrics
|
||||
|
||||
#### Example
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler)
|
||||
state = await adaptive.digest(
|
||||
start_url="https://docs.python.org",
|
||||
query="async context managers"
|
||||
)
|
||||
```
|
||||
|
||||
## Properties
|
||||
|
||||
### confidence
|
||||
|
||||
Current confidence score (0-1) indicating information sufficiency.
|
||||
|
||||
```python
|
||||
@property
|
||||
def confidence(self) -> float
|
||||
```
|
||||
|
||||
### coverage_stats
|
||||
|
||||
Dictionary containing detailed coverage statistics.
|
||||
|
||||
```python
|
||||
@property
|
||||
def coverage_stats(self) -> Dict[str, float]
|
||||
```
|
||||
|
||||
Returns:
|
||||
- **coverage**: Query term coverage score
|
||||
- **consistency**: Information consistency score
|
||||
- **saturation**: Content saturation score
|
||||
- **confidence**: Overall confidence score
|
||||
|
||||
### is_sufficient
|
||||
|
||||
Boolean indicating whether sufficient information has been gathered.
|
||||
|
||||
```python
|
||||
@property
|
||||
def is_sufficient(self) -> bool
|
||||
```
|
||||
|
||||
### state
|
||||
|
||||
Access to the current crawl state.
|
||||
|
||||
```python
|
||||
@property
|
||||
def state(self) -> CrawlState
|
||||
```
|
||||
|
||||
## Methods
|
||||
|
||||
### get_relevant_content()
|
||||
|
||||
Retrieve the most relevant content from the knowledge base.
|
||||
|
||||
```python
|
||||
def get_relevant_content(
|
||||
self,
|
||||
top_k: int = 5
|
||||
) -> List[Dict[str, Any]]
|
||||
```
|
||||
|
||||
#### Parameters
|
||||
|
||||
- **top_k** (`int`): Number of top relevant documents to return (default: 5)
|
||||
|
||||
#### Returns
|
||||
|
||||
List of dictionaries containing:
|
||||
- **url**: The URL of the page
|
||||
- **content**: The page content
|
||||
- **score**: Relevance score
|
||||
- **metadata**: Additional page metadata
|
||||
|
||||
### print_stats()
|
||||
|
||||
Display crawl statistics in formatted output.
|
||||
|
||||
```python
|
||||
def print_stats(
|
||||
self,
|
||||
detailed: bool = False
|
||||
) -> None
|
||||
```
|
||||
|
||||
#### Parameters
|
||||
|
||||
- **detailed** (`bool`): If True, shows detailed metrics with colors. If False, shows summary table.
|
||||
|
||||
### export_knowledge_base()
|
||||
|
||||
Export the collected knowledge base to a JSONL file.
|
||||
|
||||
```python
|
||||
def export_knowledge_base(
|
||||
self,
|
||||
path: Union[str, Path]
|
||||
) -> None
|
||||
```
|
||||
|
||||
#### Parameters
|
||||
|
||||
- **path** (`Union[str, Path]`): Output file path for JSONL export
|
||||
|
||||
#### Example
|
||||
|
||||
```python
|
||||
adaptive.export_knowledge_base("my_knowledge.jsonl")
|
||||
```
|
||||
|
||||
### import_knowledge_base()
|
||||
|
||||
Import a previously exported knowledge base.
|
||||
|
||||
```python
|
||||
def import_knowledge_base(
|
||||
self,
|
||||
path: Union[str, Path]
|
||||
) -> None
|
||||
```
|
||||
|
||||
#### Parameters
|
||||
|
||||
- **path** (`Union[str, Path]`): Path to JSONL file to import
|
||||
|
||||
## Configuration
|
||||
|
||||
The `AdaptiveConfig` class controls the behavior of adaptive crawling:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class AdaptiveConfig:
|
||||
confidence_threshold: float = 0.8 # Stop when confidence reaches this
|
||||
max_pages: int = 50 # Maximum pages to crawl
|
||||
top_k_links: int = 5 # Links to follow per page
|
||||
min_gain_threshold: float = 0.1 # Minimum expected gain to continue
|
||||
save_state: bool = False # Auto-save crawl state
|
||||
state_path: Optional[str] = None # Path for state persistence
|
||||
```
|
||||
|
||||
### Example with Custom Config
|
||||
|
||||
```python
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.7,
|
||||
max_pages=20,
|
||||
top_k_links=3
|
||||
)
|
||||
|
||||
adaptive = AdaptiveCrawler(crawler, config=config)
|
||||
```
|
||||
|
||||
## Complete Example
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
|
||||
|
||||
async def main():
|
||||
# Configure adaptive crawling
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.75,
|
||||
max_pages=15,
|
||||
save_state=True,
|
||||
state_path="my_crawl.json"
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
|
||||
# Start crawling
|
||||
state = await adaptive.digest(
|
||||
start_url="https://example.com/docs",
|
||||
query="authentication oauth2 jwt"
|
||||
)
|
||||
|
||||
# Check results
|
||||
print(f"Confidence achieved: {adaptive.confidence:.0%}")
|
||||
adaptive.print_stats()
|
||||
|
||||
# Get most relevant pages
|
||||
for page in adaptive.get_relevant_content(top_k=3):
|
||||
print(f"- {page['url']} (score: {page['score']:.2f})")
|
||||
|
||||
# Export for later use
|
||||
adaptive.export_knowledge_base("auth_knowledge.jsonl")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [digest() Method Reference](digest.md)
|
||||
- [Adaptive Crawling Guide](../core/adaptive-crawling.md)
|
||||
- [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md)
|
||||
@@ -215,7 +215,7 @@ Below is a snippet combining many parameters:
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
# Example schema
|
||||
|
||||
@@ -217,7 +217,7 @@ Below is an example hooking it all together:
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
import json
|
||||
|
||||
async def main():
|
||||
|
||||
992
docs/md_v2/api/c4a-script-reference.md
Normal file
@@ -0,0 +1,992 @@
|
||||
# C4A-Script API Reference
|
||||
|
||||
Complete reference for all C4A-Script commands, syntax, and advanced features.
|
||||
|
||||
## Command Categories
|
||||
|
||||
### 🧭 Navigation Commands
|
||||
|
||||
Navigate between pages and manage browser history.
|
||||
|
||||
#### `GO <url>`
|
||||
Navigate to a specific URL.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
GO <url>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `url` - Target URL (string)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
GO https://example.com
|
||||
GO https://api.example.com/login
|
||||
GO /relative/path
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Supports both absolute and relative URLs
|
||||
- Automatically handles protocol detection
|
||||
- Waits for page load to complete
|
||||
|
||||
---
|
||||
|
||||
#### `RELOAD`
|
||||
Refresh the current page.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
RELOAD
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
RELOAD
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Equivalent to pressing F5 or clicking browser refresh
|
||||
- Waits for page reload to complete
|
||||
- Preserves current URL
|
||||
|
||||
---
|
||||
|
||||
#### `BACK`
|
||||
Navigate back in browser history.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
BACK
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
BACK
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Equivalent to clicking browser back button
|
||||
- Does nothing if no previous page exists
|
||||
- Waits for navigation to complete
|
||||
|
||||
---
|
||||
|
||||
#### `FORWARD`
|
||||
Navigate forward in browser history.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
FORWARD
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
FORWARD
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Equivalent to clicking browser forward button
|
||||
- Does nothing if no next page exists
|
||||
- Waits for navigation to complete
|
||||
|
||||
### ⏱️ Wait Commands
|
||||
|
||||
Control timing and synchronization with page elements.
|
||||
|
||||
#### `WAIT <time>`
|
||||
Wait for a specified number of seconds.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
WAIT <seconds>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `seconds` - Number of seconds to wait (number)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
WAIT 3
|
||||
WAIT 1.5
|
||||
WAIT 10
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Accepts decimal values
|
||||
- Useful for giving dynamic content time to load
|
||||
- Non-blocking for other browser operations
|
||||
|
||||
---
|
||||
|
||||
#### `WAIT <selector> <timeout>`
|
||||
Wait for an element to appear on the page.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
WAIT `<selector>` <timeout>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `selector` - CSS selector for the element (string in backticks)
|
||||
- `timeout` - Maximum seconds to wait (number)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
WAIT `#content` 10
|
||||
WAIT `.loading-spinner` 5
|
||||
WAIT `button[type="submit"]` 15
|
||||
WAIT `.results .item:first-child` 8
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Fails if element doesn't appear within timeout
|
||||
- More reliable than fixed time waits
|
||||
- Supports complex CSS selectors
|
||||
|
||||
---
|
||||
|
||||
#### `WAIT "<text>" <timeout>`
|
||||
Wait for specific text to appear anywhere on the page.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
WAIT "<text>" <timeout>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `text` - Text content to wait for (string in quotes)
|
||||
- `timeout` - Maximum seconds to wait (number)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
WAIT "Loading complete" 10
|
||||
WAIT "Welcome back" 5
|
||||
WAIT "Search results" 15
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Case-sensitive text matching
|
||||
- Searches entire page content
|
||||
- Useful for dynamic status messages
|
||||
|
||||
### 🖱️ Mouse Commands
|
||||
|
||||
Simulate mouse interactions and movements.
|
||||
|
||||
#### `CLICK <selector>`
|
||||
Click on an element specified by CSS selector.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
CLICK `<selector>`
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `selector` - CSS selector for the element (string in backticks)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
CLICK `#submit-button`
|
||||
CLICK `.menu-item:first-child`
|
||||
CLICK `button[data-action="save"]`
|
||||
CLICK `a[href="/dashboard"]`
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Waits for element to be clickable
|
||||
- Scrolls element into view if necessary
|
||||
- Handles overlapping elements intelligently
|
||||
|
||||
---
|
||||
|
||||
#### `CLICK <x> <y>`
|
||||
Click at specific coordinates on the page.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
CLICK <x> <y>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `x` - X coordinate in pixels (number)
|
||||
- `y` - Y coordinate in pixels (number)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
CLICK 100 200
|
||||
CLICK 500 300
|
||||
CLICK 0 0
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Coordinates are relative to viewport
|
||||
- Useful when element selectors are unreliable
|
||||
- Consider responsive design implications
|
||||
|
||||
---
|
||||
|
||||
#### `DOUBLE_CLICK <selector>`
|
||||
Double-click on an element.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
DOUBLE_CLICK `<selector>`
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `selector` - CSS selector for the element (string in backticks)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
DOUBLE_CLICK `.file-icon`
|
||||
DOUBLE_CLICK `#editable-cell`
|
||||
DOUBLE_CLICK `.expandable-item`
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Triggers dblclick event
|
||||
- Common for opening files or editing inline content
|
||||
- Timing between clicks is automatically handled
|
||||
|
||||
---
|
||||
|
||||
#### `RIGHT_CLICK <selector>`
|
||||
Right-click on an element to open context menu.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
RIGHT_CLICK `<selector>`
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `selector` - CSS selector for the element (string in backticks)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
RIGHT_CLICK `#context-target`
|
||||
RIGHT_CLICK `.menu-trigger`
|
||||
RIGHT_CLICK `img.thumbnail`
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Opens browser/application context menu
|
||||
- Useful for testing context menu interactions
|
||||
- May be blocked by some applications
|
||||
|
||||
---
|
||||
|
||||
#### `SCROLL <direction> <amount>`
|
||||
Scroll the page in a specified direction.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
SCROLL <direction> <amount>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `direction` - Direction to scroll: `UP`, `DOWN`, `LEFT`, `RIGHT`
|
||||
- `amount` - Number of pixels to scroll (number)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
SCROLL DOWN 500
|
||||
SCROLL UP 200
|
||||
SCROLL LEFT 100
|
||||
SCROLL RIGHT 300
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Smooth scrolling animation
|
||||
- Useful for infinite scroll pages
|
||||
- Amount can be larger than viewport
|
||||
|
||||
---
|
||||
|
||||
#### `MOVE <x> <y>`
|
||||
Move mouse cursor to specific coordinates.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
MOVE <x> <y>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `x` - X coordinate in pixels (number)
|
||||
- `y` - Y coordinate in pixels (number)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
MOVE 200 100
|
||||
MOVE 500 400
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Triggers hover effects
|
||||
- Useful for testing mouseover interactions
|
||||
- Does not click, only moves cursor
|
||||
|
||||
---
|
||||
|
||||
#### `DRAG <x1> <y1> <x2> <y2>`
|
||||
Drag from one point to another.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
DRAG <x1> <y1> <x2> <y2>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `x1`, `y1` - Starting coordinates (numbers)
|
||||
- `x2`, `y2` - Ending coordinates (numbers)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
DRAG 100 100 500 300
|
||||
DRAG 0 200 400 200
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Simulates click, drag, and release
|
||||
- Useful for sliders, resizing, reordering
|
||||
- Smooth drag animation
|
||||
|
||||
### ⌨️ Keyboard Commands
|
||||
|
||||
Simulate keyboard input and key presses.
|
||||
|
||||
#### `TYPE "<text>"`
|
||||
Type text into the currently focused element.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
TYPE "<text>"
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `text` - Text to type (string in quotes)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
TYPE "Hello, World!"
|
||||
TYPE "user@example.com"
|
||||
TYPE "Password123!"
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Requires an input element to be focused
|
||||
- Types character by character with realistic timing
|
||||
- Supports special characters and Unicode
|
||||
|
||||
---
|
||||
|
||||
#### `TYPE $<variable>`
|
||||
Type the value of a variable.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
TYPE $<variable>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `variable` - Variable name (without quotes)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
SETVAR email = "user@example.com"
|
||||
TYPE $email
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Variable must be defined with SETVAR first
|
||||
- Variable values are strings
|
||||
- Useful for reusable credentials or data
|
||||
|
||||
---
|
||||
|
||||
#### `PRESS <key>`
|
||||
Press and release a special key.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
PRESS <key>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `key` - Key name (see supported keys below)
|
||||
|
||||
**Supported Keys:**
|
||||
- `Tab`, `Enter`, `Escape`, `Space`
|
||||
- `ArrowUp`, `ArrowDown`, `ArrowLeft`, `ArrowRight`
|
||||
- `Delete`, `Backspace`
|
||||
- `Home`, `End`, `PageUp`, `PageDown`
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
PRESS Tab
|
||||
PRESS Enter
|
||||
PRESS Escape
|
||||
PRESS ArrowDown
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Simulates actual key press and release
|
||||
- Useful for form navigation and shortcuts
|
||||
- Case-sensitive key names
|
||||
|
||||
---
|
||||
|
||||
#### `KEY_DOWN <key>`
|
||||
Hold down a modifier key.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
KEY_DOWN <key>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `key` - Modifier key: `Shift`, `Control`, `Alt`, `Meta`
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
KEY_DOWN Shift
|
||||
KEY_DOWN Control
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Must be paired with KEY_UP
|
||||
- Useful for key combinations
|
||||
- Meta key is Cmd on Mac, Windows key on PC
|
||||
|
||||
---
|
||||
|
||||
#### `KEY_UP <key>`
|
||||
Release a modifier key.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
KEY_UP <key>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `key` - Modifier key: `Shift`, `Control`, `Alt`, `Meta`
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
KEY_UP Shift
|
||||
KEY_UP Control
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Must be paired with KEY_DOWN
|
||||
- Releases the specified modifier key
|
||||
- Good practice to always release held keys
|
||||
|
||||
---
|
||||
|
||||
#### `CLEAR <selector>`
|
||||
Clear the content of an input field.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
CLEAR `<selector>`
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `selector` - CSS selector for input element (string in backticks)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
CLEAR `#search-box`
|
||||
CLEAR `input[name="email"]`
|
||||
CLEAR `.form-input:first-child`
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Works with input, textarea elements
|
||||
- Faster than selecting all and deleting
|
||||
- Triggers appropriate change events
|
||||
|
||||
---
|
||||
|
||||
#### `SET <selector> "<value>"`
|
||||
Set the value of an input field directly.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
SET `<selector>` "<value>"
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `selector` - CSS selector for input element (string in backticks)
|
||||
- `value` - Value to set (string in quotes)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
SET `#email` "user@example.com"
|
||||
SET `#age` "25"
|
||||
SET `textarea#message` "Hello, this is a test message."
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Directly sets value without typing animation
|
||||
- Faster than TYPE for long text
|
||||
- Triggers change and input events
|
||||
|
||||
### 🔀 Control Flow Commands
|
||||
|
||||
Add conditional logic and loops to your scripts.
|
||||
|
||||
#### `IF (EXISTS <selector>) THEN <command>`
|
||||
Execute command if element exists.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
IF (EXISTS `<selector>`) THEN <command>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `selector` - CSS selector to check (string in backticks)
|
||||
- `command` - Command to execute if condition is true
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-cookies`
|
||||
IF (EXISTS `#popup-modal`) THEN CLICK `.close-button`
|
||||
IF (EXISTS `.error-message`) THEN RELOAD
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Checks for element existence at time of execution
|
||||
- Does not wait for element to appear
|
||||
- Can be combined with ELSE
|
||||
|
||||
---
|
||||
|
||||
#### `IF (EXISTS <selector>) THEN <command> ELSE <command>`
|
||||
Execute command based on element existence.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
IF (EXISTS `<selector>`) THEN <command> ELSE <command>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `selector` - CSS selector to check (string in backticks)
|
||||
- First `command` - Execute if condition is true
|
||||
- Second `command` - Execute if condition is false
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
IF (EXISTS `.user-menu`) THEN CLICK `.logout` ELSE CLICK `.login`
|
||||
IF (EXISTS `.loading`) THEN WAIT 5 ELSE CLICK `#continue`
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Exactly one command will be executed
|
||||
- Useful for handling different page states
|
||||
- Commands must be on same line
|
||||
|
||||
---
|
||||
|
||||
#### `IF (NOT EXISTS <selector>) THEN <command>`
|
||||
Execute command if element does not exist.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
IF (NOT EXISTS `<selector>`) THEN <command>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `selector` - CSS selector to check (string in backticks)
|
||||
- `command` - Command to execute if element doesn't exist
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
IF (NOT EXISTS `.logged-in`) THEN GO /login
|
||||
IF (NOT EXISTS `.results`) THEN CLICK `#search-button`
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Inverse of EXISTS condition
|
||||
- Useful for error handling
|
||||
- Can check for missing required elements
|
||||
|
||||
---
|
||||
|
||||
#### `IF (<javascript>) THEN <command>`
|
||||
Execute command based on JavaScript condition.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
IF (`<javascript>`) THEN <command>
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `javascript` - JavaScript expression that returns boolean (string in backticks)
|
||||
- `command` - Command to execute if condition is true
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
IF (`window.innerWidth < 768`) THEN CLICK `.mobile-menu`
|
||||
IF (`document.readyState === "complete"`) THEN CLICK `#start`
|
||||
IF (`localStorage.getItem("user")`) THEN GO /dashboard
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- JavaScript executes in browser context
|
||||
- Must return boolean value
|
||||
- Access to all browser APIs and globals
|
||||
|
||||
---
|
||||
|
||||
#### `REPEAT (<command>, <count>)`
|
||||
Repeat a command a specific number of times.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
REPEAT (<command>, <count>)
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `command` - Command to repeat
|
||||
- `count` - Number of times to repeat (number)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
REPEAT (SCROLL DOWN 300, 5)
|
||||
REPEAT (PRESS Tab, 3)
|
||||
REPEAT (CLICK `.load-more`, 10)
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Executes command exactly count times
|
||||
- Useful for pagination, scrolling, navigation
|
||||
- No delay between repetitions (add WAIT if needed)
|
||||
|
||||
---
|
||||
|
||||
#### `REPEAT (<command>, <condition>)`
|
||||
Repeat a command while condition is true.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
REPEAT (<command>, `<condition>`)
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `command` - Command to repeat
|
||||
- `condition` - JavaScript condition to check (string in backticks)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
REPEAT (SCROLL DOWN 500, `document.querySelector(".load-more")`)
|
||||
REPEAT (PRESS ArrowDown, `window.scrollY < document.body.scrollHeight`)
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Condition checked before each iteration
|
||||
- JavaScript condition must return boolean
|
||||
- Be careful to avoid infinite loops
|
||||
|
||||
### 💾 Variables and Data
|
||||
|
||||
Store and manipulate data within scripts.
|
||||
|
||||
#### `SETVAR <name> = "<value>"`
|
||||
Create or update a variable.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
SETVAR <name> = "<value>"
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `name` - Variable name (alphanumeric, underscore)
|
||||
- `value` - Variable value (string in quotes)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
SETVAR username = "john@example.com"
|
||||
SETVAR password = "secret123"
|
||||
SETVAR base_url = "https://api.example.com"
|
||||
SETVAR counter = "0"
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Variables are global within script scope
|
||||
- Values are always strings
|
||||
- Can be used with TYPE command using $variable syntax
|
||||
|
||||
---
|
||||
|
||||
#### `EVAL <javascript>`
|
||||
Execute arbitrary JavaScript code.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
EVAL `<javascript>`
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `javascript` - JavaScript code to execute (string in backticks)
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
EVAL `console.log("Script started")`
|
||||
EVAL `window.scrollTo(0, 0)`
|
||||
EVAL `localStorage.setItem("test", "value")`
|
||||
EVAL `document.title = "Automated Test"`
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Full access to browser JavaScript APIs
|
||||
- Useful for custom logic and debugging
|
||||
- Return values are not captured
|
||||
- Be careful with security implications
|
||||
|
||||
### 📝 Comments and Documentation
|
||||
|
||||
#### `# <comment>`
|
||||
Add comments to scripts for documentation.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
# <comment text>
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
# This script logs into the application
|
||||
# Step 1: Navigate to login page
|
||||
GO /login
|
||||
|
||||
# Step 2: Fill credentials
|
||||
TYPE "user@example.com"
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Comments are ignored during execution
|
||||
- Useful for documentation and debugging
|
||||
- Can appear anywhere in script
|
||||
- Supports multi-line documentation blocks
|
||||
|
||||
### 🔧 Procedures (Advanced)
|
||||
|
||||
Define reusable command sequences.
|
||||
|
||||
#### `PROC <name> ... ENDPROC`
|
||||
Define a reusable procedure.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
PROC <name>
|
||||
<commands>
|
||||
ENDPROC
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `name` - Procedure name (alphanumeric, underscore)
|
||||
- `commands` - Commands to include in procedure
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
PROC login
|
||||
CLICK `#email`
|
||||
TYPE $email
|
||||
CLICK `#password`
|
||||
TYPE $password
|
||||
CLICK `#submit`
|
||||
ENDPROC
|
||||
|
||||
PROC handle_popups
|
||||
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
|
||||
IF (EXISTS `.newsletter-modal`) THEN CLICK `.close`
|
||||
ENDPROC
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Procedures must be defined before use
|
||||
- Support nested command structures
|
||||
- Variables are shared with main script scope
|
||||
|
||||
---
|
||||
|
||||
#### `<procedure_name>`
|
||||
Call a defined procedure.
|
||||
|
||||
**Syntax:**
|
||||
```c4a
|
||||
<procedure_name>
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
```c4a
|
||||
# Define procedure first
|
||||
PROC setup
|
||||
GO /login
|
||||
WAIT `#form` 5
|
||||
ENDPROC
|
||||
|
||||
# Call procedure
|
||||
setup
|
||||
login
|
||||
```
|
||||
|
||||
**Notes:**
|
||||
- Procedure must be defined before calling
|
||||
- Can be called multiple times
|
||||
- No parameters supported (use variables instead)
|
||||
|
||||
## Error Handling Best Practices
|
||||
|
||||
### 1. Always Use Waits
|
||||
```c4a
|
||||
# Bad - element might not be ready
|
||||
CLICK `#button`
|
||||
|
||||
# Good - wait for element first
|
||||
WAIT `#button` 5
|
||||
CLICK `#button`
|
||||
```
|
||||
|
||||
### 2. Handle Optional Elements
|
||||
```c4a
|
||||
# Check before interacting
|
||||
IF (EXISTS `.popup`) THEN CLICK `.close`
|
||||
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
|
||||
|
||||
# Then proceed with main flow
|
||||
CLICK `#main-action`
|
||||
```
|
||||
|
||||
### 3. Use Descriptive Variables
|
||||
```c4a
|
||||
# Set up reusable data
|
||||
SETVAR admin_email = "admin@company.com"
|
||||
SETVAR test_password = "TestPass123!"
|
||||
SETVAR staging_url = "https://staging.example.com"
|
||||
|
||||
# Use throughout script
|
||||
GO $staging_url
|
||||
TYPE $admin_email
|
||||
```
|
||||
|
||||
### 4. Add Debugging Information
|
||||
```c4a
|
||||
# Log progress
|
||||
EVAL `console.log("Starting login process")`
|
||||
GO /login
|
||||
|
||||
# Verify page state
|
||||
IF (`document.title.includes("Login")`) THEN EVAL `console.log("On login page")`
|
||||
|
||||
# Continue with login
|
||||
TYPE $username
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Login Flow
|
||||
```c4a
|
||||
# Complete login automation
|
||||
SETVAR email = "user@example.com"
|
||||
SETVAR password = "mypassword"
|
||||
|
||||
GO /login
|
||||
WAIT `#login-form` 5
|
||||
|
||||
# Handle optional cookie banner
|
||||
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-cookies`
|
||||
|
||||
# Fill and submit form
|
||||
CLICK `#email`
|
||||
TYPE $email
|
||||
PRESS Tab
|
||||
TYPE $password
|
||||
CLICK `button[type="submit"]`
|
||||
|
||||
# Wait for redirect
|
||||
WAIT `.dashboard` 10
|
||||
```
|
||||
|
||||
### Infinite Scroll
|
||||
```c4a
|
||||
# Load all content with infinite scroll
|
||||
GO /products
|
||||
|
||||
# Scroll and load more content
|
||||
REPEAT (SCROLL DOWN 500, `document.querySelector(".load-more")`)
|
||||
|
||||
# Alternative: Fixed number of scrolls
|
||||
REPEAT (SCROLL DOWN 800, 10)
|
||||
WAIT 2
|
||||
```
|
||||
|
||||
### Form Validation
|
||||
```c4a
|
||||
# Handle form with validation
|
||||
SET `#email` "invalid-email"
|
||||
CLICK `#submit`
|
||||
|
||||
# Check for validation error
|
||||
IF (EXISTS `.error-email`) THEN SET `#email` "valid@example.com"
|
||||
|
||||
# Retry submission
|
||||
CLICK `#submit`
|
||||
WAIT `.success-message` 5
|
||||
```
|
||||
|
||||
### Multi-step Process
|
||||
```c4a
|
||||
# Complex multi-step workflow
|
||||
PROC navigate_to_step
|
||||
CLICK `.next-button`
|
||||
WAIT `.step-content` 5
|
||||
ENDPROC
|
||||
|
||||
# Step 1
|
||||
WAIT `.step-1` 5
|
||||
SET `#name` "John Doe"
|
||||
navigate_to_step
|
||||
|
||||
# Step 2
|
||||
SET `#email` "john@example.com"
|
||||
navigate_to_step
|
||||
|
||||
# Step 3
|
||||
CLICK `#submit-final`
|
||||
WAIT `.confirmation` 10
|
||||
```
|
||||
|
||||
## Integration with Crawl4AI
|
||||
|
||||
Use C4A-Script with Crawl4AI for dynamic content interaction:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
# Define interaction script
|
||||
script = """
|
||||
# Handle dynamic content loading
|
||||
WAIT `.content` 5
|
||||
IF (EXISTS `.load-more-button`) THEN CLICK `.load-more-button`
|
||||
WAIT `.additional-content` 5
|
||||
|
||||
# Accept cookies if needed
|
||||
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-all`
|
||||
"""
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
c4a_script=script,
|
||||
wait_for=".content",
|
||||
screenshot=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
This reference covers all available C4A-Script commands and patterns. For interactive learning, try the [tutorial](../examples/c4a_script/tutorial/) or [live demo](https://docs.crawl4ai.com/c4a-script/demo).
|
||||
181
docs/md_v2/api/digest.md
Normal file
@@ -0,0 +1,181 @@
|
||||
# digest()
|
||||
|
||||
The `digest()` method is the primary interface for adaptive web crawling. It intelligently crawls websites starting from a given URL, guided by a query, and automatically determines when sufficient information has been gathered.
|
||||
|
||||
## Method Signature
|
||||
|
||||
```python
|
||||
async def digest(
|
||||
start_url: str,
|
||||
query: str,
|
||||
resume_from: Optional[Union[str, Path]] = None
|
||||
) -> CrawlState
|
||||
```
|
||||
|
||||
## Parameters
|
||||
|
||||
### start_url
|
||||
- **Type**: `str`
|
||||
- **Required**: Yes
|
||||
- **Description**: The starting URL for the crawl. This should be a valid HTTP/HTTPS URL that serves as the entry point for information gathering.
|
||||
|
||||
### query
|
||||
- **Type**: `str`
|
||||
- **Required**: Yes
|
||||
- **Description**: The search query that guides the crawling process. This should contain key terms related to the information you're seeking. The crawler uses this to evaluate relevance and determine which links to follow.
|
||||
|
||||
### resume_from
|
||||
- **Type**: `Optional[Union[str, Path]]`
|
||||
- **Default**: `None`
|
||||
- **Description**: Path to a previously saved crawl state file. When provided, the crawler resumes from the saved state instead of starting fresh.
|
||||
|
||||
## Return Value
|
||||
|
||||
Returns a `CrawlState` object containing:
|
||||
|
||||
- **crawled_urls** (`Set[str]`): All URLs that have been crawled
|
||||
- **knowledge_base** (`List[CrawlResult]`): Collection of crawled pages with content
|
||||
- **pending_links** (`List[Link]`): Links discovered but not yet crawled
|
||||
- **metrics** (`Dict[str, float]`): Performance and quality metrics
|
||||
- **query** (`str`): The original query
|
||||
- Additional statistical information for scoring
|
||||
|
||||
## How It Works
|
||||
|
||||
The `digest()` method implements an intelligent crawling algorithm:
|
||||
|
||||
1. **Initial Crawl**: Starts from the provided URL
|
||||
2. **Link Analysis**: Evaluates all discovered links for relevance
|
||||
3. **Scoring**: Uses three metrics to assess information sufficiency:
|
||||
- **Coverage**: How well the query terms are covered
|
||||
- **Consistency**: Information coherence across pages
|
||||
- **Saturation**: Diminishing returns detection
|
||||
4. **Adaptive Selection**: Chooses the most promising links to follow
|
||||
5. **Stopping Decision**: Automatically stops when confidence threshold is reached
|
||||
|
||||
## Examples
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler)
|
||||
|
||||
state = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/",
|
||||
query="async await context managers"
|
||||
)
|
||||
|
||||
print(f"Crawled {len(state.crawled_urls)} pages")
|
||||
print(f"Confidence: {adaptive.confidence:.0%}")
|
||||
```
|
||||
|
||||
### With Configuration
|
||||
|
||||
```python
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.9, # Require high confidence
|
||||
max_pages=30, # Allow more pages
|
||||
top_k_links=3 # Follow top 3 links per page
|
||||
)
|
||||
|
||||
adaptive = AdaptiveCrawler(crawler, config=config)
|
||||
|
||||
state = await adaptive.digest(
|
||||
start_url="https://api.example.com/docs",
|
||||
query="authentication endpoints rate limits"
|
||||
)
|
||||
```
|
||||
|
||||
### Resuming a Previous Crawl
|
||||
|
||||
```python
|
||||
# First crawl - may be interrupted
|
||||
state1 = await adaptive.digest(
|
||||
start_url="https://example.com",
|
||||
query="machine learning algorithms"
|
||||
)
|
||||
|
||||
# Save state (if not auto-saved)
|
||||
state1.save("ml_crawl_state.json")
|
||||
|
||||
# Later, resume from saved state
|
||||
state2 = await adaptive.digest(
|
||||
start_url="https://example.com",
|
||||
query="machine learning algorithms",
|
||||
resume_from="ml_crawl_state.json"
|
||||
)
|
||||
```
|
||||
|
||||
### With Progress Monitoring
|
||||
|
||||
```python
|
||||
state = await adaptive.digest(
|
||||
start_url="https://docs.example.com",
|
||||
query="api reference"
|
||||
)
|
||||
|
||||
# Monitor progress
|
||||
print(f"Pages crawled: {len(state.crawled_urls)}")
|
||||
print(f"New terms discovered: {state.new_terms_history}")
|
||||
print(f"Final confidence: {adaptive.confidence:.2%}")
|
||||
|
||||
# View detailed statistics
|
||||
adaptive.print_stats(detailed=True)
|
||||
```
|
||||
|
||||
## Query Best Practices
|
||||
|
||||
1. **Be Specific**: Use descriptive terms that appear in target content
|
||||
```python
|
||||
# Good
|
||||
query = "python async context managers implementation"
|
||||
|
||||
# Too broad
|
||||
query = "python programming"
|
||||
```
|
||||
|
||||
2. **Include Key Terms**: Add technical terms you expect to find
|
||||
```python
|
||||
query = "oauth2 jwt refresh tokens authorization"
|
||||
```
|
||||
|
||||
3. **Multiple Concepts**: Combine related concepts for comprehensive coverage
|
||||
```python
|
||||
query = "rest api pagination sorting filtering"
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Initial URL**: Choose a page with good navigation (e.g., documentation index)
|
||||
- **Query Length**: 3-8 terms typically work best
|
||||
- **Link Density**: Sites with clear navigation crawl more efficiently
|
||||
- **Caching**: Enable caching for repeated crawls of the same domain
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
try:
|
||||
state = await adaptive.digest(
|
||||
start_url="https://example.com",
|
||||
query="search terms"
|
||||
)
|
||||
except Exception as e:
|
||||
print(f"Crawl failed: {e}")
|
||||
# State is auto-saved if save_state=True in config
|
||||
```
|
||||
|
||||
## Stopping Conditions
|
||||
|
||||
The crawl stops when any of these conditions are met:
|
||||
|
||||
1. **Confidence Threshold**: Reached the configured confidence level
|
||||
2. **Page Limit**: Crawled the maximum number of pages
|
||||
3. **Diminishing Returns**: Expected information gain below threshold
|
||||
4. **No Relevant Links**: No promising links remain to follow
|
||||
|
||||
## See Also
|
||||
|
||||
- [AdaptiveCrawler Class](adaptive-crawler.md)
|
||||
- [Adaptive Crawling Guide](../core/adaptive-crawling.md)
|
||||
- [Configuration Options](../core/adaptive-crawling.md#configuration-options)
|
||||
@@ -169,7 +169,46 @@ Use these for link-level content filtering (often to keep crawls “internal”
|
||||
|
||||
---
|
||||
|
||||
## 2.2 Helper Methods
|
||||
|
||||
### H) **Virtual Scroll Configuration**
|
||||
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| **`virtual_scroll_config`** | `VirtualScrollConfig or dict` (None) | Configuration for handling virtualized scrolling on sites like Twitter/Instagram where content is replaced rather than appended. |
|
||||
|
||||
When sites use virtual scrolling (content replaced as you scroll), use `VirtualScrollConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai import VirtualScrollConfig
|
||||
|
||||
virtual_config = VirtualScrollConfig(
|
||||
container_selector="#timeline", # CSS selector for scrollable container
|
||||
scroll_count=30, # Number of times to scroll
|
||||
scroll_by="container_height", # How much to scroll: "container_height", "page_height", or pixels (e.g. 500)
|
||||
wait_after_scroll=0.5 # Seconds to wait after each scroll for content to load
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
virtual_scroll_config=virtual_config
|
||||
)
|
||||
```
|
||||
|
||||
**VirtualScrollConfig Parameters:**
|
||||
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|------------------------|---------------------------|-------------------------------------------------------------------------------------------|
|
||||
| **`container_selector`** | `str` (required) | CSS selector for the scrollable container (e.g., `"#feed"`, `".timeline"`) |
|
||||
| **`scroll_count`** | `int` (10) | Maximum number of scrolls to perform |
|
||||
| **`scroll_by`** | `str or int` ("container_height") | Scroll amount: `"container_height"`, `"page_height"`, or pixels (e.g., `500`) |
|
||||
| **`wait_after_scroll`** | `float` (0.5) | Time in seconds to wait after each scroll for new content to load |
|
||||
|
||||
**When to use Virtual Scroll vs scan_full_page:**
|
||||
- Use `virtual_scroll_config` when content is **replaced** during scroll (Twitter, Instagram)
|
||||
- Use `scan_full_page` when content is **appended** during scroll (traditional infinite scroll)
|
||||
|
||||
See [Virtual Scroll documentation](../../advanced/virtual-scroll.md) for detailed examples.
|
||||
|
||||
---## 2.2 Helper Methods
|
||||
|
||||
Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies:
|
||||
|
||||
@@ -259,7 +298,7 @@ LLMConfig is useful to pass LLM provider config to strategies and functions that
|
||||
## 3.1 Parameters
|
||||
| **Parameter** | **Type / Default** | **What It Does** |
|
||||
|-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| **`provider`** | `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)* | Which LLM provoder to use.
|
||||
| **`provider`** | `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)* | Which LLM provider to use.
|
||||
| **`api_token`** |1.Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables <br/> 2. API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"` <br/> 3. Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"` | API token to use for the given provider
|
||||
| **`base_url`** |Optional. Custom API endpoint | If your provider has a custom endpoint
|
||||
|
||||
|
||||
@@ -169,7 +169,7 @@ OverlappingWindowChunking(
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from crawl4ai import LLMExtractionStrategy
|
||||
from crawl4ai import LLMConfig
|
||||
|
||||
# Define schema
|
||||
@@ -247,7 +247,7 @@ async with AsyncWebCrawler() as crawler:
|
||||
### CSS Extraction
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
|
||||
# Define schema
|
||||
schema = {
|
||||
|
||||
BIN
docs/md_v2/apps/assets/DankMono-Bold.woff2
Normal file
BIN
docs/md_v2/apps/assets/DankMono-Italic.woff2
Normal file
BIN
docs/md_v2/apps/assets/DankMono-Regular.woff2
Normal file
396
docs/md_v2/apps/c4a-script/README.md
Normal file
@@ -0,0 +1,396 @@
|
||||
# C4A-Script Interactive Tutorial
|
||||
|
||||
A comprehensive web-based tutorial for learning and experimenting with C4A-Script - Crawl4AI's visual web automation language.
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
### Prerequisites
|
||||
- Python 3.7+
|
||||
- Modern web browser (Chrome, Firefox, Safari, Edge)
|
||||
|
||||
### Running the Tutorial
|
||||
|
||||
1. **Clone and Navigate**
|
||||
```bash
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai/docs/examples/c4a_script/tutorial/
|
||||
```
|
||||
|
||||
2. **Install Dependencies**
|
||||
```bash
|
||||
pip install flask
|
||||
```
|
||||
|
||||
3. **Launch the Server**
|
||||
```bash
|
||||
python server.py
|
||||
```
|
||||
|
||||
4. **Open in Browser**
|
||||
```
|
||||
http://localhost:8080
|
||||
```
|
||||
|
||||
**🌐 Try Online**: [Live Demo](https://docs.crawl4ai.com/c4a-script/demo)
|
||||
|
||||
### 2. Try Your First Script
|
||||
|
||||
```c4a
|
||||
# Basic interaction
|
||||
GO playground/
|
||||
WAIT `body` 2
|
||||
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
|
||||
CLICK `#start-tutorial`
|
||||
```
|
||||
|
||||
## 🎯 What You'll Learn
|
||||
|
||||
### Core Features
|
||||
- **📝 Text Editor**: Write C4A-Script with syntax highlighting
|
||||
- **🧩 Visual Editor**: Build scripts using drag-and-drop Blockly interface
|
||||
- **🎬 Recording Mode**: Capture browser actions and auto-generate scripts
|
||||
- **⚡ Live Execution**: Run scripts in real-time with instant feedback
|
||||
- **📊 Timeline View**: Visualize and edit automation steps
|
||||
|
||||
## 📚 Tutorial Content
|
||||
|
||||
### Basic Commands
|
||||
- **Navigation**: `GO url`
|
||||
- **Waiting**: `WAIT selector timeout` or `WAIT seconds`
|
||||
- **Clicking**: `CLICK selector`
|
||||
- **Typing**: `TYPE "text"`
|
||||
- **Scrolling**: `SCROLL DOWN/UP amount`
|
||||
|
||||
### Control Flow
|
||||
- **Conditionals**: `IF (condition) THEN action`
|
||||
- **Loops**: `REPEAT (action, condition)`
|
||||
- **Procedures**: Define reusable command sequences
|
||||
|
||||
### Advanced Features
|
||||
- **JavaScript evaluation**: `EVAL code`
|
||||
- **Variables**: `SET name = "value"`
|
||||
- **Complex selectors**: CSS selectors in backticks
|
||||
|
||||
## 🎮 Interactive Playground Features
|
||||
|
||||
The tutorial includes a fully interactive web app with:
|
||||
|
||||
### 1. **Authentication System**
|
||||
- Login form with validation
|
||||
- Session management
|
||||
- Protected content
|
||||
|
||||
### 2. **Dynamic Content**
|
||||
- Infinite scroll products
|
||||
- Pagination controls
|
||||
- Load more buttons
|
||||
|
||||
### 3. **Complex Forms**
|
||||
- Multi-step wizards
|
||||
- Dynamic field visibility
|
||||
- Form validation
|
||||
|
||||
### 4. **Interactive Elements**
|
||||
- Tabs and accordions
|
||||
- Modals and popups
|
||||
- Expandable content
|
||||
|
||||
### 5. **Data Tables**
|
||||
- Sortable columns
|
||||
- Search functionality
|
||||
- Export options
|
||||
|
||||
## 🛠️ Tutorial Features
|
||||
|
||||
### Live Code Editor
|
||||
- Syntax highlighting
|
||||
- Real-time compilation
|
||||
- Error messages with suggestions
|
||||
|
||||
### JavaScript Output Viewer
|
||||
- See generated JavaScript code
|
||||
- Edit and test JS directly
|
||||
- Understand the compilation
|
||||
|
||||
### Visual Execution
|
||||
- Step-by-step progress
|
||||
- Element highlighting
|
||||
- Console output
|
||||
|
||||
### Example Scripts
|
||||
Load pre-written examples demonstrating:
|
||||
- Cookie banner handling
|
||||
- Login workflows
|
||||
- Infinite scroll automation
|
||||
- Multi-step form completion
|
||||
- Complex interaction sequences
|
||||
|
||||
## 📖 Tutorial Sections
|
||||
|
||||
### 1. Getting Started
|
||||
Learn basic commands and syntax:
|
||||
```c4a
|
||||
GO https://example.com
|
||||
WAIT `.content` 5
|
||||
CLICK `.button`
|
||||
```
|
||||
|
||||
### 2. Handling Dynamic Content
|
||||
Master waiting strategies and conditionals:
|
||||
```c4a
|
||||
IF (EXISTS `.popup`) THEN CLICK `.close`
|
||||
WAIT `.results` 10
|
||||
```
|
||||
|
||||
### 3. Form Automation
|
||||
Fill and submit forms:
|
||||
```c4a
|
||||
CLICK `#email`
|
||||
TYPE "user@example.com"
|
||||
CLICK `button[type="submit"]`
|
||||
```
|
||||
|
||||
### 4. Advanced Workflows
|
||||
Build complex automation flows:
|
||||
```c4a
|
||||
PROC login
|
||||
CLICK `#username`
|
||||
TYPE $username
|
||||
CLICK `#password`
|
||||
TYPE $password
|
||||
CLICK `#login-btn`
|
||||
ENDPROC
|
||||
|
||||
SET username = "demo"
|
||||
SET password = "pass123"
|
||||
login
|
||||
```
|
||||
|
||||
## 🎯 Practice Challenges
|
||||
|
||||
### Challenge 1: Cookie & Popups
|
||||
Handle the cookie banner and newsletter popup that appear on page load.
|
||||
|
||||
### Challenge 2: Complete Login
|
||||
Successfully log into the application using the demo credentials.
|
||||
|
||||
### Challenge 3: Load All Products
|
||||
Use infinite scroll to load all 100 products in the catalog.
|
||||
|
||||
### Challenge 4: Multi-step Survey
|
||||
Complete the entire multi-step survey form.
|
||||
|
||||
### Challenge 5: Full Workflow
|
||||
Create a script that logs in, browses products, and exports data.
|
||||
|
||||
## 💡 Tips & Tricks
|
||||
|
||||
### 1. Use Specific Selectors
|
||||
```c4a
|
||||
# Good - specific
|
||||
CLICK `button.submit-order`
|
||||
|
||||
# Bad - too generic
|
||||
CLICK `button`
|
||||
```
|
||||
|
||||
### 2. Always Handle Popups
|
||||
```c4a
|
||||
# Check for common popups
|
||||
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
|
||||
IF (EXISTS `.newsletter-modal`) THEN CLICK `.close`
|
||||
```
|
||||
|
||||
### 3. Add Appropriate Waits
|
||||
```c4a
|
||||
# Wait for elements before interacting
|
||||
WAIT `.form` 5
|
||||
CLICK `#submit`
|
||||
```
|
||||
|
||||
### 4. Use Procedures for Reusability
|
||||
```c4a
|
||||
PROC handle_popups
|
||||
IF (EXISTS `.popup`) THEN CLICK `.close`
|
||||
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
|
||||
ENDPROC
|
||||
|
||||
# Use anywhere
|
||||
handle_popups
|
||||
```
|
||||
|
||||
## 🔧 Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **"Element not found"**
|
||||
- Add a WAIT before clicking
|
||||
- Check selector specificity
|
||||
- Verify element exists with IF
|
||||
|
||||
2. **"Timeout waiting for selector"**
|
||||
- Increase timeout value
|
||||
- Check if element is dynamically loaded
|
||||
- Verify selector is correct
|
||||
|
||||
3. **"Missing THEN keyword"**
|
||||
- All IF statements need THEN
|
||||
- Format: `IF (condition) THEN action`
|
||||
|
||||
## 🚀 Using with Crawl4AI
|
||||
|
||||
Once you've mastered C4A-Script in the tutorial, use it with Crawl4AI:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
url="https://example.com",
|
||||
c4a_script="""
|
||||
WAIT `.content` 5
|
||||
IF (EXISTS `.load-more`) THEN CLICK `.load-more`
|
||||
WAIT `.new-content` 3
|
||||
"""
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(config=config)
|
||||
```
|
||||
|
||||
## 📝 Example Scripts
|
||||
|
||||
Check the `scripts/` folder for complete examples:
|
||||
- `01-basic-interaction.c4a` - Getting started
|
||||
- `02-login-flow.c4a` - Authentication
|
||||
- `03-infinite-scroll.c4a` - Dynamic content
|
||||
- `04-multi-step-form.c4a` - Complex forms
|
||||
- `05-complex-workflow.c4a` - Full automation
|
||||
|
||||
## 🏗️ Developer Guide
|
||||
|
||||
### Project Architecture
|
||||
|
||||
```
|
||||
tutorial/
|
||||
├── server.py # Flask application server
|
||||
├── assets/ # Tutorial-specific assets
|
||||
│ ├── app.js # Main application logic
|
||||
│ ├── c4a-blocks.js # Custom Blockly blocks
|
||||
│ ├── c4a-generator.js # Code generation
|
||||
│ ├── blockly-manager.js # Blockly integration
|
||||
│ └── styles.css # Main styling
|
||||
├── playground/ # Interactive demo environment
|
||||
│ ├── index.html # Demo web application
|
||||
│ ├── app.js # Demo app logic
|
||||
│ └── styles.css # Demo styling
|
||||
├── scripts/ # Example C4A scripts
|
||||
└── index.html # Main tutorial interface
|
||||
```
|
||||
|
||||
### Key Components
|
||||
|
||||
#### 1. TutorialApp (`assets/app.js`)
|
||||
Main application controller managing:
|
||||
- Code editor integration (CodeMirror)
|
||||
- Script execution and browser preview
|
||||
- Tutorial navigation and lessons
|
||||
- State management and persistence
|
||||
|
||||
#### 2. BlocklyManager (`assets/blockly-manager.js`)
|
||||
Visual programming interface:
|
||||
- Custom C4A-Script block definitions
|
||||
- Bidirectional sync between visual blocks and text
|
||||
- Real-time code generation
|
||||
- Dark theme integration
|
||||
|
||||
#### 3. Recording System
|
||||
Powers the recording functionality:
|
||||
- Browser event capture
|
||||
- Smart event grouping and filtering
|
||||
- Automatic C4A-Script generation
|
||||
- Timeline visualization
|
||||
|
||||
### Customization
|
||||
|
||||
#### Adding New Commands
|
||||
1. **Define Block** (`assets/c4a-blocks.js`)
|
||||
2. **Add Generator** (`assets/c4a-generator.js`)
|
||||
3. **Update Parser** (`assets/blockly-manager.js`)
|
||||
|
||||
#### Themes and Styling
|
||||
- Main styles: `assets/styles.css`
|
||||
- Theme variables: CSS custom properties
|
||||
- Dark mode: Auto-applied based on system preference
|
||||
|
||||
### Configuration
|
||||
```python
|
||||
# server.py configuration
|
||||
PORT = 8080
|
||||
DEBUG = True
|
||||
THREADED = True
|
||||
```
|
||||
|
||||
### API Endpoints
|
||||
- `GET /` - Main tutorial interface
|
||||
- `GET /playground/` - Interactive demo environment
|
||||
- `POST /execute` - Script execution endpoint
|
||||
- `GET /examples/<script>` - Load example scripts
|
||||
|
||||
## 🔧 Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Port Already in Use**
|
||||
```bash
|
||||
# Kill existing process
|
||||
lsof -ti:8080 | xargs kill -9
|
||||
# Or use different port
|
||||
python server.py --port 8081
|
||||
```
|
||||
|
||||
**Blockly Not Loading**
|
||||
- Check browser console for JavaScript errors
|
||||
- Verify all static files are served correctly
|
||||
- Ensure proper script loading order
|
||||
|
||||
**Recording Issues**
|
||||
- Verify iframe permissions
|
||||
- Check cross-origin communication
|
||||
- Ensure event listeners are attached
|
||||
|
||||
### Debug Mode
|
||||
Enable detailed logging by setting `DEBUG = True` in `assets/app.js`
|
||||
|
||||
## 📚 Additional Resources
|
||||
|
||||
- **[C4A-Script Documentation](../../md_v2/core/c4a-script.md)** - Complete language guide
|
||||
- **[API Reference](../../md_v2/api/c4a-script-reference.md)** - Detailed command documentation
|
||||
- **[Live Demo](https://docs.crawl4ai.com/c4a-script/demo)** - Try without installation
|
||||
- **[Example Scripts](../)** - More automation examples
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
### Bug Reports
|
||||
1. Check existing issues on GitHub
|
||||
2. Provide minimal reproduction steps
|
||||
3. Include browser and system information
|
||||
4. Add relevant console logs
|
||||
|
||||
### Feature Requests
|
||||
1. Fork the repository
|
||||
2. Create feature branch: `git checkout -b feature/my-feature`
|
||||
3. Test thoroughly with different browsers
|
||||
4. Update documentation
|
||||
5. Submit pull request
|
||||
|
||||
### Code Style
|
||||
- Use consistent indentation (2 spaces for JS, 4 for Python)
|
||||
- Add comments for complex logic
|
||||
- Follow existing naming conventions
|
||||
- Test with multiple browsers
|
||||
|
||||
---
|
||||
|
||||
**Happy Automating!** 🎉
|
||||
|
||||
Need help? Check our [documentation](https://docs.crawl4ai.com) or open an issue on [GitHub](https://github.com/unclecode/crawl4ai).
|
||||
BIN
docs/md_v2/apps/c4a-script/assets/DankMono-Bold.woff2
Normal file
BIN
docs/md_v2/apps/c4a-script/assets/DankMono-Italic.woff2
Normal file
BIN
docs/md_v2/apps/c4a-script/assets/DankMono-Regular.woff2
Normal file
906
docs/md_v2/apps/c4a-script/assets/app.css
Normal file
@@ -0,0 +1,906 @@
|
||||
/* ================================================================
|
||||
C4A-Script Tutorial - App Layout CSS
|
||||
Terminal theme with Dank Mono font
|
||||
================================================================ */
|
||||
|
||||
/* CSS Variables */
|
||||
:root {
|
||||
--bg-primary: #070708;
|
||||
--bg-secondary: #0e0e10;
|
||||
--bg-tertiary: #1a1a1b;
|
||||
--border-color: #2a2a2c;
|
||||
--border-hover: #3a3a3c;
|
||||
--text-primary: #e0e0e0;
|
||||
--text-secondary: #8b8b8d;
|
||||
--text-muted: #606065;
|
||||
--primary-color: #0fbbaa;
|
||||
--primary-hover: #0da89a;
|
||||
--primary-dim: #0a8577;
|
||||
--error-color: #ff5555;
|
||||
--warning-color: #ffb86c;
|
||||
--success-color: #50fa7b;
|
||||
--info-color: #8be9fd;
|
||||
--code-bg: #1e1e20;
|
||||
--modal-overlay: rgba(0, 0, 0, 0.8);
|
||||
}
|
||||
|
||||
/* Base Reset */
|
||||
* {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
/* Fonts */
|
||||
@font-face {
|
||||
font-family: 'Dank Mono';
|
||||
src: url('DankMono-Regular.woff2') format('woff2');
|
||||
font-weight: 400;
|
||||
font-style: normal;
|
||||
}
|
||||
|
||||
@font-face {
|
||||
font-family: 'Dank Mono';
|
||||
src: url('DankMono-Bold.woff2') format('woff2');
|
||||
font-weight: 700;
|
||||
font-style: normal;
|
||||
}
|
||||
|
||||
@font-face {
|
||||
font-family: 'Dank Mono';
|
||||
src: url('DankMono-Italic.woff2') format('woff2');
|
||||
font-weight: 400;
|
||||
font-style: italic;
|
||||
}
|
||||
|
||||
/* Body & App Container */
|
||||
body {
|
||||
font-family: 'Dank Mono', 'Monaco', 'Consolas', monospace;
|
||||
background: var(--bg-primary);
|
||||
color: var(--text-primary);
|
||||
font-size: 14px;
|
||||
line-height: 1.6;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.app-container {
|
||||
display: flex;
|
||||
height: 100vh;
|
||||
width: 100vw;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
/* Panels */
|
||||
.editor-panel,
|
||||
.playground-panel {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
height: 100%;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.editor-panel {
|
||||
flex: 1;
|
||||
background: var(--bg-secondary);
|
||||
border-right: 1px solid var(--border-color);
|
||||
min-width: 400px;
|
||||
}
|
||||
|
||||
.playground-panel {
|
||||
flex: 1;
|
||||
background: var(--bg-primary);
|
||||
min-width: 400px;
|
||||
}
|
||||
|
||||
/* Panel Headers */
|
||||
.panel-header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
padding: 12px 16px;
|
||||
background: var(--bg-tertiary);
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
flex-shrink: 0;
|
||||
}
|
||||
|
||||
.panel-header h2 {
|
||||
font-size: 16px;
|
||||
font-weight: 600;
|
||||
color: var(--primary-color);
|
||||
margin: 0;
|
||||
}
|
||||
|
||||
.header-actions {
|
||||
display: flex;
|
||||
gap: 8px;
|
||||
}
|
||||
|
||||
/* Action Buttons */
|
||||
.action-btn {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 6px;
|
||||
padding: 6px 12px;
|
||||
background: var(--bg-secondary);
|
||||
color: var(--text-secondary);
|
||||
border: 1px solid var(--border-color);
|
||||
border-radius: 4px;
|
||||
font-family: inherit;
|
||||
font-size: 13px;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s;
|
||||
}
|
||||
|
||||
.action-btn:hover {
|
||||
background: var(--bg-tertiary);
|
||||
color: var(--text-primary);
|
||||
border-color: var(--border-hover);
|
||||
}
|
||||
|
||||
.action-btn.primary {
|
||||
background: var(--primary-color);
|
||||
color: var(--bg-primary);
|
||||
border-color: var(--primary-color);
|
||||
}
|
||||
|
||||
.action-btn.primary:hover {
|
||||
background: var(--primary-hover);
|
||||
border-color: var(--primary-hover);
|
||||
}
|
||||
|
||||
.action-btn .icon {
|
||||
font-size: 16px;
|
||||
}
|
||||
|
||||
/* Editor Wrapper */
|
||||
.editor-wrapper {
|
||||
flex: 1;
|
||||
display: flex;
|
||||
overflow: hidden;
|
||||
position: relative;
|
||||
z-index: 1; /* Ensure it's above any potential overlays */
|
||||
}
|
||||
|
||||
.editor-wrapper .CodeMirror {
|
||||
flex: 1;
|
||||
height: 100%;
|
||||
font-family: 'Dank Mono', monospace;
|
||||
font-size: 14px;
|
||||
line-height: 1.5;
|
||||
}
|
||||
|
||||
/* Ensure CodeMirror is interactive */
|
||||
.CodeMirror {
|
||||
background: var(--bg-primary) !important;
|
||||
}
|
||||
|
||||
.CodeMirror-scroll {
|
||||
overflow: auto !important;
|
||||
}
|
||||
|
||||
/* Make cursor more visible */
|
||||
.CodeMirror-cursor {
|
||||
border-left: 2px solid var(--primary-color) !important;
|
||||
border-left-width: 2px !important;
|
||||
opacity: 1 !important;
|
||||
visibility: visible !important;
|
||||
}
|
||||
|
||||
/* Ensure cursor is visible when focused */
|
||||
.CodeMirror-focused .CodeMirror-cursor {
|
||||
visibility: visible !important;
|
||||
}
|
||||
|
||||
/* Fix for CodeMirror in flex container */
|
||||
.CodeMirror-sizer {
|
||||
min-height: auto !important;
|
||||
}
|
||||
|
||||
/* Remove aggressive pointer-events override */
|
||||
.CodeMirror-code {
|
||||
cursor: text;
|
||||
}
|
||||
|
||||
.editor-wrapper textarea {
|
||||
display: none;
|
||||
}
|
||||
|
||||
/* Output Section (Bottom of Editor) */
|
||||
.output-section {
|
||||
height: 250px;
|
||||
border-top: 1px solid var(--border-color);
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
flex-shrink: 0;
|
||||
}
|
||||
|
||||
/* Tabs */
|
||||
.tabs {
|
||||
display: flex;
|
||||
background: var(--bg-tertiary);
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
flex-shrink: 0;
|
||||
}
|
||||
|
||||
.tab {
|
||||
padding: 8px 20px;
|
||||
background: transparent;
|
||||
color: var(--text-secondary);
|
||||
border: none;
|
||||
border-bottom: 2px solid transparent;
|
||||
font-family: inherit;
|
||||
font-size: 13px;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s;
|
||||
}
|
||||
|
||||
.tab:hover {
|
||||
color: var(--text-primary);
|
||||
background: var(--bg-secondary);
|
||||
}
|
||||
|
||||
.tab.active {
|
||||
color: var(--primary-color);
|
||||
border-bottom-color: var(--primary-color);
|
||||
}
|
||||
|
||||
/* Tab Content */
|
||||
.tab-content {
|
||||
flex: 1;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.tab-pane {
|
||||
display: none;
|
||||
height: 100%;
|
||||
overflow-y: auto;
|
||||
}
|
||||
|
||||
.tab-pane.active {
|
||||
display: block;
|
||||
}
|
||||
|
||||
/* Console */
|
||||
.console {
|
||||
padding: 12px;
|
||||
background: var(--bg-primary);
|
||||
font-size: 13px;
|
||||
min-height: 100%;
|
||||
}
|
||||
|
||||
.console-line {
|
||||
margin-bottom: 8px;
|
||||
display: flex;
|
||||
align-items: flex-start;
|
||||
gap: 8px;
|
||||
}
|
||||
|
||||
.console-prompt {
|
||||
color: var(--primary-color);
|
||||
flex-shrink: 0;
|
||||
}
|
||||
|
||||
.console-text {
|
||||
color: var(--text-primary);
|
||||
}
|
||||
|
||||
.console-error {
|
||||
color: var(--error-color);
|
||||
}
|
||||
|
||||
.console-warning {
|
||||
color: var(--warning-color);
|
||||
}
|
||||
|
||||
.console-success {
|
||||
color: var(--success-color);
|
||||
}
|
||||
|
||||
/* JavaScript Output */
|
||||
.js-output-header {
|
||||
display: flex;
|
||||
justify-content: flex-end;
|
||||
padding: 8px 12px;
|
||||
background: var(--bg-tertiary);
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
}
|
||||
|
||||
.js-actions {
|
||||
display: flex;
|
||||
gap: 8px;
|
||||
}
|
||||
|
||||
.mini-btn {
|
||||
padding: 4px 8px;
|
||||
background: var(--bg-secondary);
|
||||
color: var(--text-secondary);
|
||||
border: 1px solid var(--border-color);
|
||||
border-radius: 3px;
|
||||
font-size: 12px;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s;
|
||||
}
|
||||
|
||||
.mini-btn:hover {
|
||||
background: var(--bg-primary);
|
||||
color: var(--text-primary);
|
||||
}
|
||||
|
||||
.js-output {
|
||||
padding: 12px;
|
||||
background: var(--code-bg);
|
||||
color: var(--text-primary);
|
||||
font-family: 'Dank Mono', monospace;
|
||||
font-size: 13px;
|
||||
line-height: 1.5;
|
||||
white-space: pre-wrap;
|
||||
margin: 0;
|
||||
min-height: calc(100% - 44px);
|
||||
}
|
||||
|
||||
/* Execution Progress */
|
||||
.execution-progress {
|
||||
padding: 12px;
|
||||
background: var(--bg-primary);
|
||||
}
|
||||
|
||||
.progress-item {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 8px;
|
||||
margin-bottom: 8px;
|
||||
font-size: 13px;
|
||||
}
|
||||
|
||||
.progress-icon {
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
.progress-item.active .progress-icon {
|
||||
color: var(--info-color);
|
||||
animation: pulse 1s infinite;
|
||||
}
|
||||
|
||||
.progress-item.completed .progress-icon {
|
||||
color: var(--success-color);
|
||||
}
|
||||
|
||||
.progress-item.error .progress-icon {
|
||||
color: var(--error-color);
|
||||
}
|
||||
|
||||
/* Playground */
|
||||
.playground-wrapper {
|
||||
flex: 1;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
#playground-frame {
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
border: none;
|
||||
background: var(--bg-secondary);
|
||||
}
|
||||
|
||||
/* Tutorial Intro Modal */
|
||||
.tutorial-intro-modal {
|
||||
position: fixed;
|
||||
top: 0;
|
||||
left: 0;
|
||||
right: 0;
|
||||
bottom: 0;
|
||||
background: var(--modal-overlay);
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
z-index: 2000;
|
||||
transition: opacity 0.3s;
|
||||
}
|
||||
|
||||
.tutorial-intro-modal.hidden {
|
||||
display: none;
|
||||
}
|
||||
|
||||
.intro-content {
|
||||
background: var(--bg-tertiary);
|
||||
border: 1px solid var(--border-color);
|
||||
border-radius: 8px;
|
||||
padding: 32px;
|
||||
max-width: 500px;
|
||||
box-shadow: 0 16px 48px rgba(0, 0, 0, 0.6);
|
||||
}
|
||||
|
||||
.intro-content h2 {
|
||||
color: var(--primary-color);
|
||||
margin-bottom: 16px;
|
||||
font-size: 24px;
|
||||
}
|
||||
|
||||
.intro-content p {
|
||||
color: var(--text-primary);
|
||||
margin-bottom: 16px;
|
||||
line-height: 1.6;
|
||||
}
|
||||
|
||||
.intro-content ul {
|
||||
list-style: none;
|
||||
margin-bottom: 24px;
|
||||
}
|
||||
|
||||
.intro-content li {
|
||||
color: var(--text-secondary);
|
||||
margin-bottom: 8px;
|
||||
padding-left: 20px;
|
||||
position: relative;
|
||||
}
|
||||
|
||||
.intro-content li:before {
|
||||
content: "▸";
|
||||
position: absolute;
|
||||
left: 0;
|
||||
color: var(--primary-color);
|
||||
}
|
||||
|
||||
.intro-actions {
|
||||
display: flex;
|
||||
gap: 12px;
|
||||
justify-content: flex-end;
|
||||
}
|
||||
|
||||
.intro-btn {
|
||||
padding: 10px 24px;
|
||||
background: var(--bg-secondary);
|
||||
color: var(--text-primary);
|
||||
border: 1px solid var(--border-color);
|
||||
border-radius: 4px;
|
||||
font-family: inherit;
|
||||
font-size: 14px;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s;
|
||||
}
|
||||
|
||||
.intro-btn:hover {
|
||||
background: var(--bg-primary);
|
||||
border-color: var(--border-hover);
|
||||
}
|
||||
|
||||
.intro-btn.primary {
|
||||
background: var(--primary-color);
|
||||
color: var(--bg-primary);
|
||||
border-color: var(--primary-color);
|
||||
}
|
||||
|
||||
.intro-btn.primary:hover {
|
||||
background: var(--primary-hover);
|
||||
border-color: var(--primary-hover);
|
||||
}
|
||||
|
||||
/* Tutorial Navigation Bar */
|
||||
.tutorial-nav {
|
||||
position: fixed;
|
||||
top: 0;
|
||||
left: 0;
|
||||
right: 0;
|
||||
background: var(--bg-tertiary);
|
||||
border-bottom: 1px solid var(--primary-color);
|
||||
z-index: 1000;
|
||||
transition: transform 0.3s;
|
||||
}
|
||||
|
||||
.tutorial-nav.hidden {
|
||||
transform: translateY(-100%);
|
||||
}
|
||||
|
||||
.tutorial-nav-content {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: space-between;
|
||||
padding: 16px 24px;
|
||||
}
|
||||
|
||||
.tutorial-left {
|
||||
flex: 1;
|
||||
}
|
||||
|
||||
.tutorial-step-title {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 16px;
|
||||
margin-bottom: 8px;
|
||||
}
|
||||
|
||||
.tutorial-step-title span:first-child {
|
||||
color: var(--text-secondary);
|
||||
font-size: 12px;
|
||||
text-transform: uppercase;
|
||||
}
|
||||
|
||||
.tutorial-step-title span:last-child {
|
||||
color: var(--primary-color);
|
||||
font-weight: 600;
|
||||
font-size: 16px;
|
||||
}
|
||||
|
||||
.tutorial-description {
|
||||
color: var(--text-primary);
|
||||
margin: 0;
|
||||
font-size: 14px;
|
||||
max-width: 600px;
|
||||
}
|
||||
|
||||
.tutorial-right {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
}
|
||||
|
||||
.tutorial-progress-bar {
|
||||
height: 3px;
|
||||
background: var(--bg-secondary);
|
||||
position: absolute;
|
||||
bottom: 0;
|
||||
left: 0;
|
||||
right: 0;
|
||||
}
|
||||
|
||||
.tutorial-progress-bar .progress-fill {
|
||||
height: 100%;
|
||||
background: var(--primary-color);
|
||||
transition: width 0.3s;
|
||||
}
|
||||
|
||||
/* Adjust app container when tutorial is active */
|
||||
.app-container.tutorial-active {
|
||||
padding-top: 80px;
|
||||
}
|
||||
|
||||
.tutorial-controls {
|
||||
display: flex;
|
||||
gap: 12px;
|
||||
}
|
||||
|
||||
.nav-btn {
|
||||
padding: 8px 16px;
|
||||
background: var(--bg-secondary);
|
||||
color: var(--text-primary);
|
||||
border: 1px solid var(--border-color);
|
||||
border-radius: 4px;
|
||||
font-family: inherit;
|
||||
font-size: 13px;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s;
|
||||
}
|
||||
|
||||
.nav-btn:hover:not(:disabled) {
|
||||
background: var(--bg-primary);
|
||||
border-color: var(--border-hover);
|
||||
}
|
||||
|
||||
.nav-btn:disabled {
|
||||
opacity: 0.5;
|
||||
cursor: not-allowed;
|
||||
}
|
||||
|
||||
.nav-btn.primary {
|
||||
background: var(--primary-color);
|
||||
color: var(--bg-primary);
|
||||
border-color: var(--primary-color);
|
||||
}
|
||||
|
||||
.nav-btn.primary:hover {
|
||||
background: var(--primary-hover);
|
||||
border-color: var(--primary-hover);
|
||||
}
|
||||
|
||||
.exit-btn {
|
||||
width: 32px;
|
||||
height: 32px;
|
||||
background: transparent;
|
||||
color: var(--text-secondary);
|
||||
border: none;
|
||||
font-size: 20px;
|
||||
cursor: pointer;
|
||||
border-radius: 4px;
|
||||
transition: all 0.2s;
|
||||
margin-left: 16px;
|
||||
}
|
||||
|
||||
.exit-btn:hover {
|
||||
background: var(--bg-secondary);
|
||||
color: var(--text-primary);
|
||||
}
|
||||
|
||||
/* Fullscreen Mode */
|
||||
.playground-panel.fullscreen {
|
||||
position: fixed;
|
||||
top: 0;
|
||||
left: 0;
|
||||
right: 0;
|
||||
bottom: 0;
|
||||
z-index: 1500;
|
||||
}
|
||||
|
||||
/* Animations */
|
||||
@keyframes pulse {
|
||||
0%, 100% { opacity: 1; }
|
||||
50% { opacity: 0.5; }
|
||||
}
|
||||
|
||||
/* Scrollbar Styling */
|
||||
::-webkit-scrollbar {
|
||||
width: 10px;
|
||||
height: 10px;
|
||||
}
|
||||
|
||||
::-webkit-scrollbar-track {
|
||||
background: var(--bg-secondary);
|
||||
}
|
||||
|
||||
::-webkit-scrollbar-thumb {
|
||||
background: var(--border-color);
|
||||
border-radius: 5px;
|
||||
}
|
||||
|
||||
::-webkit-scrollbar-thumb:hover {
|
||||
background: var(--border-hover);
|
||||
}
|
||||
|
||||
/* Responsive */
|
||||
@media (max-width: 768px) {
|
||||
.app-container {
|
||||
flex-direction: column;
|
||||
}
|
||||
|
||||
.editor-panel,
|
||||
.playground-panel {
|
||||
min-width: auto;
|
||||
width: 100%;
|
||||
}
|
||||
|
||||
.editor-panel {
|
||||
border-right: none;
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
}
|
||||
|
||||
.output-section {
|
||||
height: 200px;
|
||||
}
|
||||
}
|
||||
|
||||
/* ================================================================
|
||||
Recording Timeline Styles
|
||||
================================================================ */
|
||||
|
||||
.action-btn.record {
|
||||
background: var(--bg-tertiary);
|
||||
border-color: var(--error-color);
|
||||
}
|
||||
|
||||
.action-btn.record:hover {
|
||||
background: var(--error-color);
|
||||
border-color: var(--error-color);
|
||||
}
|
||||
|
||||
.action-btn.record.recording {
|
||||
background: var(--error-color);
|
||||
animation: pulse 1.5s infinite;
|
||||
}
|
||||
|
||||
.action-btn.record.recording .icon {
|
||||
animation: blink 1s infinite;
|
||||
}
|
||||
|
||||
@keyframes pulse {
|
||||
0%, 100% { opacity: 1; }
|
||||
50% { opacity: 0.8; }
|
||||
}
|
||||
|
||||
@keyframes blink {
|
||||
0%, 100% { opacity: 1; }
|
||||
50% { opacity: 0.3; }
|
||||
}
|
||||
|
||||
.editor-container {
|
||||
flex: 1;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
#editor-view,
|
||||
#timeline-view {
|
||||
flex: 1;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.recording-timeline {
|
||||
background: var(--bg-secondary);
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
height: 100%;
|
||||
}
|
||||
|
||||
.timeline-header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
padding: 10px 15px;
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
background: var(--bg-tertiary);
|
||||
}
|
||||
|
||||
.timeline-header h3 {
|
||||
font-size: 14px;
|
||||
font-weight: 600;
|
||||
color: var(--text-primary);
|
||||
margin: 0;
|
||||
}
|
||||
|
||||
.timeline-actions {
|
||||
display: flex;
|
||||
gap: 8px;
|
||||
}
|
||||
|
||||
.timeline-events {
|
||||
flex: 1;
|
||||
overflow-y: auto;
|
||||
padding: 10px;
|
||||
}
|
||||
|
||||
.timeline-event {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
padding: 8px 10px;
|
||||
margin-bottom: 6px;
|
||||
background: var(--bg-tertiary);
|
||||
border: 1px solid var(--border-color);
|
||||
border-radius: 4px;
|
||||
transition: all 0.2s;
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.timeline-event:hover {
|
||||
border-color: var(--border-hover);
|
||||
background: var(--code-bg);
|
||||
}
|
||||
|
||||
.timeline-event.selected {
|
||||
border-color: var(--primary-color);
|
||||
background: rgba(15, 187, 170, 0.1);
|
||||
}
|
||||
|
||||
.event-checkbox {
|
||||
margin-right: 10px;
|
||||
width: 16px;
|
||||
height: 16px;
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.event-time {
|
||||
font-size: 11px;
|
||||
color: var(--text-muted);
|
||||
margin-right: 10px;
|
||||
font-family: 'Dank Mono', monospace;
|
||||
min-width: 45px;
|
||||
}
|
||||
|
||||
.event-command {
|
||||
flex: 1;
|
||||
font-family: 'Dank Mono', monospace;
|
||||
font-size: 13px;
|
||||
color: var(--text-primary);
|
||||
}
|
||||
|
||||
.event-command .cmd-name {
|
||||
color: var(--primary-color);
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
.event-command .cmd-selector {
|
||||
color: var(--info-color);
|
||||
}
|
||||
|
||||
.event-command .cmd-value {
|
||||
color: var(--warning-color);
|
||||
}
|
||||
|
||||
.event-command .cmd-detail {
|
||||
color: var(--text-secondary);
|
||||
font-size: 11px;
|
||||
margin-left: 5px;
|
||||
}
|
||||
|
||||
.event-edit {
|
||||
margin-left: 10px;
|
||||
padding: 2px 8px;
|
||||
font-size: 11px;
|
||||
background: var(--bg-secondary);
|
||||
border: 1px solid var(--border-color);
|
||||
color: var(--text-secondary);
|
||||
cursor: pointer;
|
||||
border-radius: 3px;
|
||||
transition: all 0.2s;
|
||||
}
|
||||
|
||||
.event-edit:hover {
|
||||
border-color: var(--primary-color);
|
||||
color: var(--primary-color);
|
||||
}
|
||||
|
||||
/* Event Editor Modal */
|
||||
.modal-overlay {
|
||||
position: fixed;
|
||||
top: 0;
|
||||
left: 0;
|
||||
right: 0;
|
||||
bottom: 0;
|
||||
background: var(--modal-overlay);
|
||||
z-index: 999;
|
||||
}
|
||||
|
||||
.event-editor-modal {
|
||||
position: fixed;
|
||||
top: 50%;
|
||||
left: 50%;
|
||||
transform: translate(-50%, -50%);
|
||||
background: var(--bg-secondary);
|
||||
border: 1px solid var(--border-color);
|
||||
border-radius: 8px;
|
||||
padding: 20px;
|
||||
z-index: 1000;
|
||||
min-width: 400px;
|
||||
}
|
||||
|
||||
.event-editor-modal h4 {
|
||||
margin: 0 0 15px 0;
|
||||
color: var(--text-primary);
|
||||
font-family: 'Dank Mono', monospace;
|
||||
}
|
||||
|
||||
.editor-field {
|
||||
margin-bottom: 15px;
|
||||
}
|
||||
|
||||
.editor-field label {
|
||||
display: block;
|
||||
margin-bottom: 5px;
|
||||
font-size: 12px;
|
||||
color: var(--text-secondary);
|
||||
font-family: 'Dank Mono', monospace;
|
||||
}
|
||||
|
||||
.editor-field input,
|
||||
.editor-field select {
|
||||
width: 100%;
|
||||
padding: 8px;
|
||||
background: var(--bg-tertiary);
|
||||
border: 1px solid var(--border-color);
|
||||
color: var(--text-primary);
|
||||
border-radius: 4px;
|
||||
font-family: 'Dank Mono', monospace;
|
||||
font-size: 13px;
|
||||
}
|
||||
|
||||
.editor-field input:focus,
|
||||
.editor-field select:focus {
|
||||
outline: none;
|
||||
border-color: var(--primary-color);
|
||||
}
|
||||
|
||||
.editor-actions {
|
||||
display: flex;
|
||||
justify-content: flex-end;
|
||||
gap: 10px;
|
||||
margin-top: 20px;
|
||||
}
|
||||
|
||||
/* Blockly Button */
|
||||
#blockly-btn .icon {
|
||||
font-size: 16px;
|
||||
}
|
||||
|
||||
/* Hidden State */
|
||||
.hidden {
|
||||
display: none !important;
|
||||
}
|
||||
1485
docs/md_v2/apps/c4a-script/assets/app.js
Normal file
591
docs/md_v2/apps/c4a-script/assets/blockly-manager.js
Normal file
@@ -0,0 +1,591 @@
|
||||
// Blockly Manager for C4A-Script
|
||||
// Handles Blockly workspace, code generation, and synchronization with text editor
|
||||
|
||||
class BlocklyManager {
|
||||
constructor(tutorialApp) {
|
||||
this.app = tutorialApp;
|
||||
this.workspace = null;
|
||||
this.isUpdating = false; // Prevent circular updates
|
||||
this.blocklyVisible = false;
|
||||
this.toolboxXml = this.generateToolbox();
|
||||
|
||||
this.init();
|
||||
}
|
||||
|
||||
init() {
|
||||
this.setupBlocklyContainer();
|
||||
this.initializeWorkspace();
|
||||
this.setupEventHandlers();
|
||||
this.setupSynchronization();
|
||||
}
|
||||
|
||||
setupBlocklyContainer() {
|
||||
// Create blockly container div
|
||||
const editorContainer = document.querySelector('.editor-container');
|
||||
const blocklyDiv = document.createElement('div');
|
||||
blocklyDiv.id = 'blockly-view';
|
||||
blocklyDiv.className = 'blockly-workspace hidden';
|
||||
blocklyDiv.style.height = '100%';
|
||||
blocklyDiv.style.width = '100%';
|
||||
editorContainer.appendChild(blocklyDiv);
|
||||
}
|
||||
|
||||
generateToolbox() {
|
||||
return `
|
||||
<xml id="toolbox" style="display: none">
|
||||
<category name="Navigation" colour="${BlockColors.NAVIGATION}">
|
||||
<block type="c4a_go"></block>
|
||||
<block type="c4a_reload"></block>
|
||||
<block type="c4a_back"></block>
|
||||
<block type="c4a_forward"></block>
|
||||
</category>
|
||||
|
||||
<category name="Wait" colour="${BlockColors.WAIT}">
|
||||
<block type="c4a_wait_time">
|
||||
<field name="SECONDS">3</field>
|
||||
</block>
|
||||
<block type="c4a_wait_selector">
|
||||
<field name="SELECTOR">#content</field>
|
||||
<field name="TIMEOUT">10</field>
|
||||
</block>
|
||||
<block type="c4a_wait_text">
|
||||
<field name="TEXT">Loading complete</field>
|
||||
<field name="TIMEOUT">5</field>
|
||||
</block>
|
||||
</category>
|
||||
|
||||
<category name="Mouse Actions" colour="${BlockColors.ACTIONS}">
|
||||
<block type="c4a_click">
|
||||
<field name="SELECTOR">button.submit</field>
|
||||
</block>
|
||||
<block type="c4a_click_xy"></block>
|
||||
<block type="c4a_double_click"></block>
|
||||
<block type="c4a_right_click"></block>
|
||||
<block type="c4a_move"></block>
|
||||
<block type="c4a_drag"></block>
|
||||
<block type="c4a_scroll">
|
||||
<field name="DIRECTION">DOWN</field>
|
||||
<field name="AMOUNT">500</field>
|
||||
</block>
|
||||
</category>
|
||||
|
||||
<category name="Keyboard" colour="${BlockColors.KEYBOARD}">
|
||||
<block type="c4a_type">
|
||||
<field name="TEXT">hello@example.com</field>
|
||||
</block>
|
||||
<block type="c4a_type_var">
|
||||
<field name="VAR">email</field>
|
||||
</block>
|
||||
<block type="c4a_clear"></block>
|
||||
<block type="c4a_set">
|
||||
<field name="SELECTOR">#email</field>
|
||||
<field name="VALUE">user@example.com</field>
|
||||
</block>
|
||||
<block type="c4a_press">
|
||||
<field name="KEY">Tab</field>
|
||||
</block>
|
||||
<block type="c4a_key_down">
|
||||
<field name="KEY">Shift</field>
|
||||
</block>
|
||||
<block type="c4a_key_up">
|
||||
<field name="KEY">Shift</field>
|
||||
</block>
|
||||
</category>
|
||||
|
||||
<category name="Control Flow" colour="${BlockColors.CONTROL}">
|
||||
<block type="c4a_if_exists">
|
||||
<field name="SELECTOR">.cookie-banner</field>
|
||||
</block>
|
||||
<block type="c4a_if_exists_else">
|
||||
<field name="SELECTOR">#user</field>
|
||||
</block>
|
||||
<block type="c4a_if_not_exists">
|
||||
<field name="SELECTOR">.modal</field>
|
||||
</block>
|
||||
<block type="c4a_if_js">
|
||||
<field name="CONDITION">window.innerWidth < 768</field>
|
||||
</block>
|
||||
<block type="c4a_repeat_times">
|
||||
<field name="TIMES">5</field>
|
||||
</block>
|
||||
<block type="c4a_repeat_while">
|
||||
<field name="CONDITION">document.querySelector('.load-more')</field>
|
||||
</block>
|
||||
</category>
|
||||
|
||||
<category name="Variables" colour="${BlockColors.VARIABLES}">
|
||||
<block type="c4a_setvar">
|
||||
<field name="NAME">username</field>
|
||||
<field name="VALUE">john@example.com</field>
|
||||
</block>
|
||||
<block type="c4a_eval">
|
||||
<field name="CODE">console.log('Hello')</field>
|
||||
</block>
|
||||
</category>
|
||||
|
||||
<category name="Procedures" colour="${BlockColors.PROCEDURES}">
|
||||
<block type="c4a_proc_def">
|
||||
<field name="NAME">login</field>
|
||||
</block>
|
||||
<block type="c4a_proc_call">
|
||||
<field name="NAME">login</field>
|
||||
</block>
|
||||
</category>
|
||||
|
||||
<category name="Comments" colour="#9E9E9E">
|
||||
<block type="c4a_comment">
|
||||
<field name="TEXT">Add comment here</field>
|
||||
</block>
|
||||
</category>
|
||||
</xml>`;
|
||||
}
|
||||
|
||||
initializeWorkspace() {
|
||||
const blocklyDiv = document.getElementById('blockly-view');
|
||||
|
||||
// Dark theme configuration
|
||||
const theme = Blockly.Theme.defineTheme('c4a-dark', {
|
||||
'base': Blockly.Themes.Classic,
|
||||
'componentStyles': {
|
||||
'workspaceBackgroundColour': '#0e0e10',
|
||||
'toolboxBackgroundColour': '#1a1a1b',
|
||||
'toolboxForegroundColour': '#e0e0e0',
|
||||
'flyoutBackgroundColour': '#1a1a1b',
|
||||
'flyoutForegroundColour': '#e0e0e0',
|
||||
'flyoutOpacity': 0.9,
|
||||
'scrollbarColour': '#2a2a2c',
|
||||
'scrollbarOpacity': 0.5,
|
||||
'insertionMarkerColour': '#0fbbaa',
|
||||
'insertionMarkerOpacity': 0.3,
|
||||
'markerColour': '#0fbbaa',
|
||||
'cursorColour': '#0fbbaa',
|
||||
'selectedGlowColour': '#0fbbaa',
|
||||
'selectedGlowOpacity': 0.4,
|
||||
'replacementGlowColour': '#0fbbaa',
|
||||
'replacementGlowOpacity': 0.5
|
||||
},
|
||||
'fontStyle': {
|
||||
'family': 'Dank Mono, Monaco, Consolas, monospace',
|
||||
'weight': 'normal',
|
||||
'size': 13
|
||||
}
|
||||
});
|
||||
|
||||
this.workspace = Blockly.inject(blocklyDiv, {
|
||||
toolbox: this.toolboxXml,
|
||||
theme: theme,
|
||||
grid: {
|
||||
spacing: 20,
|
||||
length: 3,
|
||||
colour: '#2a2a2c',
|
||||
snap: true
|
||||
},
|
||||
zoom: {
|
||||
controls: true,
|
||||
wheel: true,
|
||||
startScale: 1.0,
|
||||
maxScale: 3,
|
||||
minScale: 0.3,
|
||||
scaleSpeed: 1.2
|
||||
},
|
||||
trashcan: true,
|
||||
sounds: false,
|
||||
media: 'https://unpkg.com/blockly/media/'
|
||||
});
|
||||
|
||||
// Add workspace change listener
|
||||
this.workspace.addChangeListener((event) => {
|
||||
if (!this.isUpdating && event.type !== Blockly.Events.UI) {
|
||||
this.syncBlocksToCode();
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
setupEventHandlers() {
|
||||
// Add blockly toggle button
|
||||
const headerActions = document.querySelector('.editor-panel .header-actions');
|
||||
const blocklyBtn = document.createElement('button');
|
||||
blocklyBtn.id = 'blockly-btn';
|
||||
blocklyBtn.className = 'action-btn';
|
||||
blocklyBtn.title = 'Toggle Blockly Mode';
|
||||
blocklyBtn.innerHTML = '<span class="icon">🧩</span>';
|
||||
|
||||
// Insert before the Run button
|
||||
const runBtn = document.getElementById('run-btn');
|
||||
headerActions.insertBefore(blocklyBtn, runBtn);
|
||||
|
||||
blocklyBtn.addEventListener('click', () => this.toggleBlocklyView());
|
||||
}
|
||||
|
||||
setupSynchronization() {
|
||||
// Listen to CodeMirror changes
|
||||
this.app.editor.on('change', (instance, changeObj) => {
|
||||
if (!this.isUpdating && this.blocklyVisible && changeObj.origin !== 'setValue') {
|
||||
this.syncCodeToBlocks();
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
toggleBlocklyView() {
|
||||
const editorView = document.getElementById('editor-view');
|
||||
const blocklyView = document.getElementById('blockly-view');
|
||||
const timelineView = document.getElementById('timeline-view');
|
||||
const blocklyBtn = document.getElementById('blockly-btn');
|
||||
|
||||
this.blocklyVisible = !this.blocklyVisible;
|
||||
|
||||
if (this.blocklyVisible) {
|
||||
// Show Blockly
|
||||
editorView.classList.add('hidden');
|
||||
timelineView.classList.add('hidden');
|
||||
blocklyView.classList.remove('hidden');
|
||||
blocklyBtn.classList.add('active');
|
||||
|
||||
// Resize workspace
|
||||
Blockly.svgResize(this.workspace);
|
||||
|
||||
// Sync current code to blocks
|
||||
this.syncCodeToBlocks();
|
||||
} else {
|
||||
// Show editor
|
||||
blocklyView.classList.add('hidden');
|
||||
editorView.classList.remove('hidden');
|
||||
blocklyBtn.classList.remove('active');
|
||||
|
||||
// Refresh CodeMirror
|
||||
setTimeout(() => this.app.editor.refresh(), 100);
|
||||
}
|
||||
}
|
||||
|
||||
syncBlocksToCode() {
|
||||
if (this.isUpdating) return;
|
||||
|
||||
try {
|
||||
this.isUpdating = true;
|
||||
|
||||
// Generate C4A-Script from blocks using our custom generator
|
||||
if (typeof c4aGenerator !== 'undefined') {
|
||||
const code = c4aGenerator.workspaceToCode(this.workspace);
|
||||
|
||||
// Process the code to maintain proper formatting
|
||||
const lines = code.split('\n');
|
||||
const formattedLines = [];
|
||||
let lastWasComment = false;
|
||||
|
||||
for (let i = 0; i < lines.length; i++) {
|
||||
const line = lines[i].trim();
|
||||
if (!line) continue;
|
||||
|
||||
const isComment = line.startsWith('#');
|
||||
|
||||
// Add blank line when transitioning between comments and commands
|
||||
if (formattedLines.length > 0 && lastWasComment !== isComment) {
|
||||
formattedLines.push('');
|
||||
}
|
||||
|
||||
formattedLines.push(line);
|
||||
lastWasComment = isComment;
|
||||
}
|
||||
|
||||
const cleanCode = formattedLines.join('\n');
|
||||
|
||||
// Update CodeMirror
|
||||
this.app.editor.setValue(cleanCode);
|
||||
}
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error syncing blocks to code:', error);
|
||||
} finally {
|
||||
this.isUpdating = false;
|
||||
}
|
||||
}
|
||||
|
||||
syncCodeToBlocks() {
|
||||
if (this.isUpdating) return;
|
||||
|
||||
try {
|
||||
this.isUpdating = true;
|
||||
|
||||
// Clear workspace
|
||||
this.workspace.clear();
|
||||
|
||||
// Parse C4A-Script and generate blocks
|
||||
const code = this.app.editor.getValue();
|
||||
const blocks = this.parseC4AToBlocks(code);
|
||||
|
||||
if (blocks) {
|
||||
Blockly.Xml.domToWorkspace(blocks, this.workspace);
|
||||
}
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error syncing code to blocks:', error);
|
||||
// Show error in console
|
||||
this.app.addConsoleMessage(`Blockly sync error: ${error.message}`, 'warning');
|
||||
} finally {
|
||||
this.isUpdating = false;
|
||||
}
|
||||
}
|
||||
|
||||
parseC4AToBlocks(code) {
|
||||
const lines = code.split('\n');
|
||||
const xml = document.createElement('xml');
|
||||
let yPos = 20;
|
||||
let previousBlock = null;
|
||||
let rootBlock = null;
|
||||
|
||||
for (let i = 0; i < lines.length; i++) {
|
||||
const line = lines[i].trim();
|
||||
|
||||
// Skip empty lines
|
||||
if (!line) continue;
|
||||
|
||||
// Handle comments
|
||||
if (line.startsWith('#')) {
|
||||
const commentBlock = this.parseLineToBlock(line, i, lines);
|
||||
if (commentBlock) {
|
||||
if (previousBlock) {
|
||||
// Connect to previous block
|
||||
const next = document.createElement('next');
|
||||
next.appendChild(commentBlock);
|
||||
previousBlock.appendChild(next);
|
||||
} else {
|
||||
// First block - set position
|
||||
commentBlock.setAttribute('x', 20);
|
||||
commentBlock.setAttribute('y', yPos);
|
||||
xml.appendChild(commentBlock);
|
||||
rootBlock = commentBlock;
|
||||
yPos += 60;
|
||||
}
|
||||
previousBlock = commentBlock;
|
||||
}
|
||||
continue;
|
||||
}
|
||||
|
||||
const block = this.parseLineToBlock(line, i, lines);
|
||||
|
||||
if (block) {
|
||||
if (previousBlock) {
|
||||
// Connect to previous block using <next>
|
||||
const next = document.createElement('next');
|
||||
next.appendChild(block);
|
||||
previousBlock.appendChild(next);
|
||||
} else {
|
||||
// First block - set position
|
||||
block.setAttribute('x', 20);
|
||||
block.setAttribute('y', yPos);
|
||||
xml.appendChild(block);
|
||||
rootBlock = block;
|
||||
yPos += 60;
|
||||
}
|
||||
previousBlock = block;
|
||||
}
|
||||
}
|
||||
|
||||
return xml;
|
||||
}
|
||||
|
||||
parseLineToBlock(line, index, allLines) {
|
||||
// Navigation commands
|
||||
if (line.startsWith('GO ')) {
|
||||
const url = line.substring(3).trim();
|
||||
return this.createBlock('c4a_go', { 'URL': url });
|
||||
}
|
||||
if (line === 'RELOAD') {
|
||||
return this.createBlock('c4a_reload');
|
||||
}
|
||||
if (line === 'BACK') {
|
||||
return this.createBlock('c4a_back');
|
||||
}
|
||||
if (line === 'FORWARD') {
|
||||
return this.createBlock('c4a_forward');
|
||||
}
|
||||
|
||||
// Wait commands
|
||||
if (line.startsWith('WAIT ')) {
|
||||
const parts = line.substring(5).trim();
|
||||
|
||||
// Check if it's just a number (wait time)
|
||||
if (/^\d+(\.\d+)?$/.test(parts)) {
|
||||
return this.createBlock('c4a_wait_time', { 'SECONDS': parts });
|
||||
}
|
||||
|
||||
// Check for selector wait
|
||||
const selectorMatch = parts.match(/^`([^`]+)`\s+(\d+)$/);
|
||||
if (selectorMatch) {
|
||||
return this.createBlock('c4a_wait_selector', {
|
||||
'SELECTOR': selectorMatch[1],
|
||||
'TIMEOUT': selectorMatch[2]
|
||||
});
|
||||
}
|
||||
|
||||
// Check for text wait
|
||||
const textMatch = parts.match(/^"([^"]+)"\s+(\d+)$/);
|
||||
if (textMatch) {
|
||||
return this.createBlock('c4a_wait_text', {
|
||||
'TEXT': textMatch[1],
|
||||
'TIMEOUT': textMatch[2]
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Click commands
|
||||
if (line.startsWith('CLICK ')) {
|
||||
const target = line.substring(6).trim();
|
||||
|
||||
// Check for coordinates
|
||||
const coordMatch = target.match(/^(\d+)\s+(\d+)$/);
|
||||
if (coordMatch) {
|
||||
return this.createBlock('c4a_click_xy', {
|
||||
'X': coordMatch[1],
|
||||
'Y': coordMatch[2]
|
||||
});
|
||||
}
|
||||
|
||||
// Selector click
|
||||
const selectorMatch = target.match(/^`([^`]+)`$/);
|
||||
if (selectorMatch) {
|
||||
return this.createBlock('c4a_click', {
|
||||
'SELECTOR': selectorMatch[1]
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Other mouse actions
|
||||
if (line.startsWith('DOUBLE_CLICK ')) {
|
||||
const selector = line.substring(13).trim().match(/^`([^`]+)`$/);
|
||||
if (selector) {
|
||||
return this.createBlock('c4a_double_click', {
|
||||
'SELECTOR': selector[1]
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
if (line.startsWith('RIGHT_CLICK ')) {
|
||||
const selector = line.substring(12).trim().match(/^`([^`]+)`$/);
|
||||
if (selector) {
|
||||
return this.createBlock('c4a_right_click', {
|
||||
'SELECTOR': selector[1]
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Scroll
|
||||
if (line.startsWith('SCROLL ')) {
|
||||
const match = line.match(/^SCROLL\s+(UP|DOWN|LEFT|RIGHT)(?:\s+(\d+))?$/);
|
||||
if (match) {
|
||||
return this.createBlock('c4a_scroll', {
|
||||
'DIRECTION': match[1],
|
||||
'AMOUNT': match[2] || '500'
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Type commands
|
||||
if (line.startsWith('TYPE ')) {
|
||||
const content = line.substring(5).trim();
|
||||
|
||||
// Variable type
|
||||
if (content.startsWith('$')) {
|
||||
return this.createBlock('c4a_type_var', {
|
||||
'VAR': content.substring(1)
|
||||
});
|
||||
}
|
||||
|
||||
// Text type
|
||||
const textMatch = content.match(/^"([^"]*)"$/);
|
||||
if (textMatch) {
|
||||
return this.createBlock('c4a_type', {
|
||||
'TEXT': textMatch[1]
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// SET command
|
||||
if (line.startsWith('SET ')) {
|
||||
const match = line.match(/^SET\s+`([^`]+)`\s+"([^"]*)"$/);
|
||||
if (match) {
|
||||
return this.createBlock('c4a_set', {
|
||||
'SELECTOR': match[1],
|
||||
'VALUE': match[2]
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// CLEAR command
|
||||
if (line.startsWith('CLEAR ')) {
|
||||
const match = line.match(/^CLEAR\s+`([^`]+)`$/);
|
||||
if (match) {
|
||||
return this.createBlock('c4a_clear', {
|
||||
'SELECTOR': match[1]
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// SETVAR command
|
||||
if (line.startsWith('SETVAR ')) {
|
||||
const match = line.match(/^SETVAR\s+(\w+)\s*=\s*"([^"]*)"$/);
|
||||
if (match) {
|
||||
return this.createBlock('c4a_setvar', {
|
||||
'NAME': match[1],
|
||||
'VALUE': match[2]
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// IF commands (simplified - only single line)
|
||||
if (line.startsWith('IF ')) {
|
||||
// IF EXISTS
|
||||
const existsMatch = line.match(/^IF\s+\(EXISTS\s+`([^`]+)`\)\s+THEN\s+(.+?)(?:\s+ELSE\s+(.+))?$/);
|
||||
if (existsMatch) {
|
||||
if (existsMatch[3]) {
|
||||
// Has ELSE
|
||||
const block = this.createBlock('c4a_if_exists_else', {
|
||||
'SELECTOR': existsMatch[1]
|
||||
});
|
||||
// Parse then and else commands - simplified for now
|
||||
return block;
|
||||
} else {
|
||||
// No ELSE
|
||||
const block = this.createBlock('c4a_if_exists', {
|
||||
'SELECTOR': existsMatch[1]
|
||||
});
|
||||
return block;
|
||||
}
|
||||
}
|
||||
|
||||
// IF NOT EXISTS
|
||||
const notExistsMatch = line.match(/^IF\s+\(NOT\s+EXISTS\s+`([^`]+)`\)\s+THEN\s+(.+)$/);
|
||||
if (notExistsMatch) {
|
||||
const block = this.createBlock('c4a_if_not_exists', {
|
||||
'SELECTOR': notExistsMatch[1]
|
||||
});
|
||||
return block;
|
||||
}
|
||||
}
|
||||
|
||||
// Comments
|
||||
if (line.startsWith('#')) {
|
||||
return this.createBlock('c4a_comment', {
|
||||
'TEXT': line.substring(1).trim()
|
||||
});
|
||||
}
|
||||
|
||||
// If we can't parse it, return null
|
||||
return null;
|
||||
}
|
||||
|
||||
createBlock(type, fields = {}) {
|
||||
const block = document.createElement('block');
|
||||
block.setAttribute('type', type);
|
||||
|
||||
// Add fields
|
||||
for (const [name, value] of Object.entries(fields)) {
|
||||
const field = document.createElement('field');
|
||||
field.setAttribute('name', name);
|
||||
field.textContent = value;
|
||||
block.appendChild(field);
|
||||
}
|
||||
|
||||
return block;
|
||||
}
|
||||
}
|
||||
238
docs/md_v2/apps/c4a-script/assets/blockly-theme.css
Normal file
@@ -0,0 +1,238 @@
|
||||
/* Blockly Theme CSS for C4A-Script */
|
||||
|
||||
/* Blockly workspace container */
|
||||
.blockly-workspace {
|
||||
position: relative;
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
background: var(--bg-primary);
|
||||
}
|
||||
|
||||
/* Blockly button active state */
|
||||
#blockly-btn.active {
|
||||
background: var(--primary-color);
|
||||
color: var(--bg-primary);
|
||||
border-color: var(--primary-color);
|
||||
}
|
||||
|
||||
#blockly-btn.active:hover {
|
||||
background: var(--primary-hover);
|
||||
border-color: var(--primary-hover);
|
||||
}
|
||||
|
||||
/* Override Blockly's default styles for dark theme */
|
||||
.blocklyToolboxDiv {
|
||||
background-color: var(--bg-tertiary) !important;
|
||||
border-right: 1px solid var(--border-color) !important;
|
||||
}
|
||||
|
||||
.blocklyFlyout {
|
||||
background-color: var(--bg-secondary) !important;
|
||||
}
|
||||
|
||||
.blocklyFlyoutBackground {
|
||||
fill: var(--bg-secondary) !important;
|
||||
}
|
||||
|
||||
.blocklyMainBackground {
|
||||
stroke: none !important;
|
||||
}
|
||||
|
||||
.blocklyTreeRow {
|
||||
color: var(--text-primary) !important;
|
||||
font-family: 'Dank Mono', monospace !important;
|
||||
padding: 4px 16px !important;
|
||||
margin: 2px 0 !important;
|
||||
}
|
||||
|
||||
.blocklyTreeRow:hover {
|
||||
background-color: var(--bg-secondary) !important;
|
||||
}
|
||||
|
||||
.blocklyTreeSelected {
|
||||
background-color: var(--primary-dim) !important;
|
||||
}
|
||||
|
||||
.blocklyTreeLabel {
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
/* Blockly scrollbars */
|
||||
.blocklyScrollbarHorizontal,
|
||||
.blocklyScrollbarVertical {
|
||||
background-color: transparent !important;
|
||||
}
|
||||
|
||||
.blocklyScrollbarHandle {
|
||||
fill: var(--border-color) !important;
|
||||
opacity: 0.5 !important;
|
||||
}
|
||||
|
||||
.blocklyScrollbarHandle:hover {
|
||||
fill: var(--border-hover) !important;
|
||||
opacity: 0.8 !important;
|
||||
}
|
||||
|
||||
/* Blockly zoom controls */
|
||||
.blocklyZoom > image {
|
||||
opacity: 0.6;
|
||||
}
|
||||
|
||||
.blocklyZoom > image:hover {
|
||||
opacity: 1;
|
||||
}
|
||||
|
||||
/* Blockly trash can */
|
||||
.blocklyTrash {
|
||||
opacity: 0.6;
|
||||
}
|
||||
|
||||
.blocklyTrash:hover {
|
||||
opacity: 1;
|
||||
}
|
||||
|
||||
/* Blockly context menus */
|
||||
.blocklyContextMenu {
|
||||
background-color: var(--bg-tertiary) !important;
|
||||
border: 1px solid var(--border-color) !important;
|
||||
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.3) !important;
|
||||
}
|
||||
|
||||
.blocklyMenuItem {
|
||||
color: var(--text-primary) !important;
|
||||
font-family: 'Dank Mono', monospace !important;
|
||||
}
|
||||
|
||||
.blocklyMenuItemDisabled {
|
||||
color: var(--text-muted) !important;
|
||||
}
|
||||
|
||||
.blocklyMenuItem:hover {
|
||||
background-color: var(--bg-secondary) !important;
|
||||
}
|
||||
|
||||
/* Blockly text inputs */
|
||||
.blocklyHtmlInput {
|
||||
background-color: var(--bg-tertiary) !important;
|
||||
color: var(--text-primary) !important;
|
||||
border: 1px solid var(--border-color) !important;
|
||||
font-family: 'Dank Mono', monospace !important;
|
||||
font-size: 13px !important;
|
||||
padding: 4px 8px !important;
|
||||
}
|
||||
|
||||
.blocklyHtmlInput:focus {
|
||||
border-color: var(--primary-color) !important;
|
||||
outline: none !important;
|
||||
}
|
||||
|
||||
/* Blockly dropdowns */
|
||||
.blocklyDropDownDiv {
|
||||
background-color: var(--bg-tertiary) !important;
|
||||
border: 1px solid var(--border-color) !important;
|
||||
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.3) !important;
|
||||
}
|
||||
|
||||
.blocklyDropDownContent {
|
||||
color: var(--text-primary) !important;
|
||||
}
|
||||
|
||||
.blocklyDropDownDiv .goog-menuitem {
|
||||
color: var(--text-primary) !important;
|
||||
font-family: 'Dank Mono', monospace !important;
|
||||
padding: 4px 16px !important;
|
||||
}
|
||||
|
||||
.blocklyDropDownDiv .goog-menuitem-highlight,
|
||||
.blocklyDropDownDiv .goog-menuitem-hover {
|
||||
background-color: var(--bg-secondary) !important;
|
||||
}
|
||||
|
||||
/* Custom block colors are defined in the block definitions */
|
||||
|
||||
/* Block text styling */
|
||||
.blocklyText {
|
||||
fill: #ffffff !important;
|
||||
font-family: 'Dank Mono', monospace !important;
|
||||
font-size: 13px !important;
|
||||
}
|
||||
|
||||
.blocklyEditableText > .blocklyText {
|
||||
fill: #ffffff !important;
|
||||
}
|
||||
|
||||
.blocklyEditableText:hover > rect {
|
||||
stroke: var(--primary-color) !important;
|
||||
stroke-width: 2px !important;
|
||||
}
|
||||
|
||||
/* Improve visibility of connection highlights */
|
||||
.blocklyHighlightedConnectionPath {
|
||||
stroke: var(--primary-color) !important;
|
||||
stroke-width: 4px !important;
|
||||
}
|
||||
|
||||
.blocklyInsertionMarker > .blocklyPath {
|
||||
fill-opacity: 0.3 !important;
|
||||
stroke-opacity: 0.6 !important;
|
||||
}
|
||||
|
||||
/* Workspace grid pattern */
|
||||
.blocklyWorkspace > .blocklyBlockCanvas > .blocklyGridCanvas {
|
||||
opacity: 0.1;
|
||||
}
|
||||
|
||||
/* Smooth transitions */
|
||||
.blocklyDraggable {
|
||||
transition: transform 0.1s ease;
|
||||
}
|
||||
|
||||
/* Field labels */
|
||||
.blocklyFieldLabel {
|
||||
font-weight: normal !important;
|
||||
}
|
||||
|
||||
/* Comment blocks styling */
|
||||
.blocklyCommentText {
|
||||
font-style: italic !important;
|
||||
}
|
||||
|
||||
/* Make comment blocks slightly transparent */
|
||||
g[data-category="Comments"] .blocklyPath {
|
||||
fill-opacity: 0.8 !important;
|
||||
}
|
||||
|
||||
/* Better visibility for disabled blocks */
|
||||
.blocklyDisabled > .blocklyPath {
|
||||
fill-opacity: 0.3 !important;
|
||||
}
|
||||
|
||||
.blocklyDisabled > .blocklyText {
|
||||
fill-opacity: 0.5 !important;
|
||||
}
|
||||
|
||||
/* Warning and error text */
|
||||
.blocklyWarningText,
|
||||
.blocklyErrorText {
|
||||
font-family: 'Dank Mono', monospace !important;
|
||||
font-size: 12px !important;
|
||||
}
|
||||
|
||||
/* Workspace scrollbar improvement for dark theme */
|
||||
::-webkit-scrollbar {
|
||||
width: 10px;
|
||||
height: 10px;
|
||||
}
|
||||
|
||||
::-webkit-scrollbar-track {
|
||||
background: var(--bg-secondary);
|
||||
}
|
||||
|
||||
::-webkit-scrollbar-thumb {
|
||||
background: var(--border-color);
|
||||
border-radius: 5px;
|
||||
}
|
||||
|
||||
::-webkit-scrollbar-thumb:hover {
|
||||
background: var(--border-hover);
|
||||
}
|
||||
549
docs/md_v2/apps/c4a-script/assets/c4a-blocks.js
Normal file
@@ -0,0 +1,549 @@
|
||||
// C4A-Script Blockly Block Definitions
|
||||
// This file defines all custom blocks for C4A-Script commands
|
||||
|
||||
// Color scheme for different block categories
|
||||
const BlockColors = {
|
||||
NAVIGATION: '#1E88E5', // Blue
|
||||
ACTIONS: '#43A047', // Green
|
||||
CONTROL: '#FB8C00', // Orange
|
||||
VARIABLES: '#8E24AA', // Purple
|
||||
WAIT: '#E53935', // Red
|
||||
KEYBOARD: '#00ACC1', // Cyan
|
||||
PROCEDURES: '#6A1B9A' // Deep Purple
|
||||
};
|
||||
|
||||
// Helper to create selector input with backticks
|
||||
Blockly.Blocks['c4a_selector_input'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("`")
|
||||
.appendField(new Blockly.FieldTextInput("selector"), "SELECTOR")
|
||||
.appendField("`");
|
||||
this.setOutput(true, "Selector");
|
||||
this.setColour(BlockColors.ACTIONS);
|
||||
this.setTooltip("CSS selector for element");
|
||||
}
|
||||
};
|
||||
|
||||
// ============================================
|
||||
// NAVIGATION BLOCKS
|
||||
// ============================================
|
||||
|
||||
Blockly.Blocks['c4a_go'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("GO")
|
||||
.appendField(new Blockly.FieldTextInput("https://example.com"), "URL");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.NAVIGATION);
|
||||
this.setTooltip("Navigate to URL");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_reload'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("RELOAD");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.NAVIGATION);
|
||||
this.setTooltip("Reload current page");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_back'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("BACK");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.NAVIGATION);
|
||||
this.setTooltip("Go back in browser history");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_forward'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("FORWARD");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.NAVIGATION);
|
||||
this.setTooltip("Go forward in browser history");
|
||||
}
|
||||
};
|
||||
|
||||
// ============================================
|
||||
// WAIT BLOCKS
|
||||
// ============================================
|
||||
|
||||
Blockly.Blocks['c4a_wait_time'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("WAIT")
|
||||
.appendField(new Blockly.FieldNumber(1, 0), "SECONDS")
|
||||
.appendField("seconds");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.WAIT);
|
||||
this.setTooltip("Wait for specified seconds");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_wait_selector'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("WAIT for")
|
||||
.appendField("`")
|
||||
.appendField(new Blockly.FieldTextInput("selector"), "SELECTOR")
|
||||
.appendField("`")
|
||||
.appendField("max")
|
||||
.appendField(new Blockly.FieldNumber(10, 1), "TIMEOUT")
|
||||
.appendField("sec");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.WAIT);
|
||||
this.setTooltip("Wait for element to appear");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_wait_text'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("WAIT for text")
|
||||
.appendField(new Blockly.FieldTextInput("Loading complete"), "TEXT")
|
||||
.appendField("max")
|
||||
.appendField(new Blockly.FieldNumber(5, 1), "TIMEOUT")
|
||||
.appendField("sec");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.WAIT);
|
||||
this.setTooltip("Wait for text to appear on page");
|
||||
}
|
||||
};
|
||||
|
||||
// ============================================
|
||||
// MOUSE ACTION BLOCKS
|
||||
// ============================================
|
||||
|
||||
Blockly.Blocks['c4a_click'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("CLICK")
|
||||
.appendField("`")
|
||||
.appendField(new Blockly.FieldTextInput("button"), "SELECTOR")
|
||||
.appendField("`");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.ACTIONS);
|
||||
this.setTooltip("Click on element");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_click_xy'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("CLICK at")
|
||||
.appendField("X:")
|
||||
.appendField(new Blockly.FieldNumber(100, 0), "X")
|
||||
.appendField("Y:")
|
||||
.appendField(new Blockly.FieldNumber(100, 0), "Y");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.ACTIONS);
|
||||
this.setTooltip("Click at coordinates");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_double_click'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("DOUBLE_CLICK")
|
||||
.appendField("`")
|
||||
.appendField(new Blockly.FieldTextInput(".item"), "SELECTOR")
|
||||
.appendField("`");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.ACTIONS);
|
||||
this.setTooltip("Double click on element");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_right_click'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("RIGHT_CLICK")
|
||||
.appendField("`")
|
||||
.appendField(new Blockly.FieldTextInput("#menu"), "SELECTOR")
|
||||
.appendField("`");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.ACTIONS);
|
||||
this.setTooltip("Right click on element");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_move'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("MOVE to")
|
||||
.appendField("X:")
|
||||
.appendField(new Blockly.FieldNumber(500, 0), "X")
|
||||
.appendField("Y:")
|
||||
.appendField(new Blockly.FieldNumber(300, 0), "Y");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.ACTIONS);
|
||||
this.setTooltip("Move mouse to position");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_drag'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("DRAG from")
|
||||
.appendField("X:")
|
||||
.appendField(new Blockly.FieldNumber(100, 0), "X1")
|
||||
.appendField("Y:")
|
||||
.appendField(new Blockly.FieldNumber(100, 0), "Y1");
|
||||
this.appendDummyInput()
|
||||
.appendField("to")
|
||||
.appendField("X:")
|
||||
.appendField(new Blockly.FieldNumber(500, 0), "X2")
|
||||
.appendField("Y:")
|
||||
.appendField(new Blockly.FieldNumber(300, 0), "Y2");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.ACTIONS);
|
||||
this.setTooltip("Drag from one point to another");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_scroll'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("SCROLL")
|
||||
.appendField(new Blockly.FieldDropdown([
|
||||
["DOWN", "DOWN"],
|
||||
["UP", "UP"],
|
||||
["LEFT", "LEFT"],
|
||||
["RIGHT", "RIGHT"]
|
||||
]), "DIRECTION")
|
||||
.appendField(new Blockly.FieldNumber(500, 0), "AMOUNT")
|
||||
.appendField("pixels");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.ACTIONS);
|
||||
this.setTooltip("Scroll in direction");
|
||||
}
|
||||
};
|
||||
|
||||
// ============================================
|
||||
// KEYBOARD BLOCKS
|
||||
// ============================================
|
||||
|
||||
Blockly.Blocks['c4a_type'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("TYPE")
|
||||
.appendField(new Blockly.FieldTextInput("text to type"), "TEXT");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.KEYBOARD);
|
||||
this.setTooltip("Type text");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_type_var'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("TYPE")
|
||||
.appendField("$")
|
||||
.appendField(new Blockly.FieldTextInput("variable"), "VAR");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.KEYBOARD);
|
||||
this.setTooltip("Type variable value");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_clear'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("CLEAR")
|
||||
.appendField("`")
|
||||
.appendField(new Blockly.FieldTextInput("input"), "SELECTOR")
|
||||
.appendField("`");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.KEYBOARD);
|
||||
this.setTooltip("Clear input field");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_set'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("SET")
|
||||
.appendField("`")
|
||||
.appendField(new Blockly.FieldTextInput("#input"), "SELECTOR")
|
||||
.appendField("`")
|
||||
.appendField("to")
|
||||
.appendField(new Blockly.FieldTextInput("value"), "VALUE");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.KEYBOARD);
|
||||
this.setTooltip("Set input field value");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_press'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("PRESS")
|
||||
.appendField(new Blockly.FieldDropdown([
|
||||
["Tab", "Tab"],
|
||||
["Enter", "Enter"],
|
||||
["Escape", "Escape"],
|
||||
["Space", "Space"],
|
||||
["ArrowUp", "ArrowUp"],
|
||||
["ArrowDown", "ArrowDown"],
|
||||
["ArrowLeft", "ArrowLeft"],
|
||||
["ArrowRight", "ArrowRight"],
|
||||
["Delete", "Delete"],
|
||||
["Backspace", "Backspace"]
|
||||
]), "KEY");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.KEYBOARD);
|
||||
this.setTooltip("Press and release key");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_key_down'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("KEY_DOWN")
|
||||
.appendField(new Blockly.FieldDropdown([
|
||||
["Shift", "Shift"],
|
||||
["Control", "Control"],
|
||||
["Alt", "Alt"],
|
||||
["Meta", "Meta"]
|
||||
]), "KEY");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.KEYBOARD);
|
||||
this.setTooltip("Hold key down");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_key_up'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("KEY_UP")
|
||||
.appendField(new Blockly.FieldDropdown([
|
||||
["Shift", "Shift"],
|
||||
["Control", "Control"],
|
||||
["Alt", "Alt"],
|
||||
["Meta", "Meta"]
|
||||
]), "KEY");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.KEYBOARD);
|
||||
this.setTooltip("Release key");
|
||||
}
|
||||
};
|
||||
|
||||
// ============================================
|
||||
// CONTROL FLOW BLOCKS
|
||||
// ============================================
|
||||
|
||||
Blockly.Blocks['c4a_if_exists'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("IF EXISTS")
|
||||
.appendField("`")
|
||||
.appendField(new Blockly.FieldTextInput(".element"), "SELECTOR")
|
||||
.appendField("`")
|
||||
.appendField("THEN");
|
||||
this.appendStatementInput("THEN")
|
||||
.setCheck(null);
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.CONTROL);
|
||||
this.setTooltip("If element exists, then do something");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_if_exists_else'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("IF EXISTS")
|
||||
.appendField("`")
|
||||
.appendField(new Blockly.FieldTextInput(".element"), "SELECTOR")
|
||||
.appendField("`")
|
||||
.appendField("THEN");
|
||||
this.appendStatementInput("THEN")
|
||||
.setCheck(null);
|
||||
this.appendDummyInput()
|
||||
.appendField("ELSE");
|
||||
this.appendStatementInput("ELSE")
|
||||
.setCheck(null);
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.CONTROL);
|
||||
this.setTooltip("If element exists, then do something, else do something else");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_if_not_exists'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("IF NOT EXISTS")
|
||||
.appendField("`")
|
||||
.appendField(new Blockly.FieldTextInput(".element"), "SELECTOR")
|
||||
.appendField("`")
|
||||
.appendField("THEN");
|
||||
this.appendStatementInput("THEN")
|
||||
.setCheck(null);
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.CONTROL);
|
||||
this.setTooltip("If element does not exist, then do something");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_if_js'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("IF")
|
||||
.appendField("`")
|
||||
.appendField(new Blockly.FieldTextInput("window.innerWidth < 768"), "CONDITION")
|
||||
.appendField("`")
|
||||
.appendField("THEN");
|
||||
this.appendStatementInput("THEN")
|
||||
.setCheck(null);
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.CONTROL);
|
||||
this.setTooltip("If JavaScript condition is true");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_repeat_times'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("REPEAT")
|
||||
.appendField(new Blockly.FieldNumber(5, 1), "TIMES")
|
||||
.appendField("times");
|
||||
this.appendStatementInput("DO")
|
||||
.setCheck(null);
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.CONTROL);
|
||||
this.setTooltip("Repeat commands N times");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_repeat_while'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("REPEAT WHILE")
|
||||
.appendField("`")
|
||||
.appendField(new Blockly.FieldTextInput("document.querySelector('.load-more')"), "CONDITION")
|
||||
.appendField("`");
|
||||
this.appendStatementInput("DO")
|
||||
.setCheck(null);
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.CONTROL);
|
||||
this.setTooltip("Repeat while condition is true");
|
||||
}
|
||||
};
|
||||
|
||||
// ============================================
|
||||
// VARIABLE BLOCKS
|
||||
// ============================================
|
||||
|
||||
Blockly.Blocks['c4a_setvar'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("SETVAR")
|
||||
.appendField(new Blockly.FieldTextInput("username"), "NAME")
|
||||
.appendField("=")
|
||||
.appendField(new Blockly.FieldTextInput("value"), "VALUE");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.VARIABLES);
|
||||
this.setTooltip("Set variable value");
|
||||
}
|
||||
};
|
||||
|
||||
// ============================================
|
||||
// ADVANCED BLOCKS
|
||||
// ============================================
|
||||
|
||||
Blockly.Blocks['c4a_eval'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("EVAL")
|
||||
.appendField("`")
|
||||
.appendField(new Blockly.FieldTextInput("console.log('Hello')"), "CODE")
|
||||
.appendField("`");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.VARIABLES);
|
||||
this.setTooltip("Execute JavaScript code");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_comment'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("#")
|
||||
.appendField(new Blockly.FieldTextInput("Comment", null, {
|
||||
spellcheck: false,
|
||||
class: 'blocklyCommentText'
|
||||
}), "TEXT");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour("#616161");
|
||||
this.setTooltip("Add a comment");
|
||||
this.setStyle('comment_blocks');
|
||||
}
|
||||
};
|
||||
|
||||
// ============================================
|
||||
// PROCEDURE BLOCKS
|
||||
// ============================================
|
||||
|
||||
Blockly.Blocks['c4a_proc_def'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("PROC")
|
||||
.appendField(new Blockly.FieldTextInput("procedure_name"), "NAME");
|
||||
this.appendStatementInput("BODY")
|
||||
.setCheck(null);
|
||||
this.appendDummyInput()
|
||||
.appendField("ENDPROC");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.PROCEDURES);
|
||||
this.setTooltip("Define a procedure");
|
||||
}
|
||||
};
|
||||
|
||||
Blockly.Blocks['c4a_proc_call'] = {
|
||||
init: function() {
|
||||
this.appendDummyInput()
|
||||
.appendField("Call")
|
||||
.appendField(new Blockly.FieldTextInput("procedure_name"), "NAME");
|
||||
this.setPreviousStatement(true, null);
|
||||
this.setNextStatement(true, null);
|
||||
this.setColour(BlockColors.PROCEDURES);
|
||||
this.setTooltip("Call a procedure");
|
||||
}
|
||||
};
|
||||
|
||||
// Code generators have been moved to c4a-generator.js
|
||||
261
docs/md_v2/apps/c4a-script/assets/c4a-generator.js
Normal file
@@ -0,0 +1,261 @@
|
||||
// C4A-Script Code Generator for Blockly
|
||||
// Compatible with latest Blockly API
|
||||
|
||||
// Create a custom code generator for C4A-Script
|
||||
const c4aGenerator = new Blockly.Generator('C4A');
|
||||
|
||||
// Helper to get field value with proper escaping
|
||||
c4aGenerator.getFieldValue = function(block, fieldName) {
|
||||
return block.getFieldValue(fieldName);
|
||||
};
|
||||
|
||||
// Navigation generators
|
||||
c4aGenerator.forBlock['c4a_go'] = function(block, generator) {
|
||||
const url = generator.getFieldValue(block, 'URL');
|
||||
return `GO ${url}\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_reload'] = function(block, generator) {
|
||||
return 'RELOAD\n';
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_back'] = function(block, generator) {
|
||||
return 'BACK\n';
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_forward'] = function(block, generator) {
|
||||
return 'FORWARD\n';
|
||||
};
|
||||
|
||||
// Wait generators
|
||||
c4aGenerator.forBlock['c4a_wait_time'] = function(block, generator) {
|
||||
const seconds = generator.getFieldValue(block, 'SECONDS');
|
||||
return `WAIT ${seconds}\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_wait_selector'] = function(block, generator) {
|
||||
const selector = generator.getFieldValue(block, 'SELECTOR');
|
||||
const timeout = generator.getFieldValue(block, 'TIMEOUT');
|
||||
return `WAIT \`${selector}\` ${timeout}\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_wait_text'] = function(block, generator) {
|
||||
const text = generator.getFieldValue(block, 'TEXT');
|
||||
const timeout = generator.getFieldValue(block, 'TIMEOUT');
|
||||
return `WAIT "${text}" ${timeout}\n`;
|
||||
};
|
||||
|
||||
// Mouse action generators
|
||||
c4aGenerator.forBlock['c4a_click'] = function(block, generator) {
|
||||
const selector = generator.getFieldValue(block, 'SELECTOR');
|
||||
return `CLICK \`${selector}\`\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_click_xy'] = function(block, generator) {
|
||||
const x = generator.getFieldValue(block, 'X');
|
||||
const y = generator.getFieldValue(block, 'Y');
|
||||
return `CLICK ${x} ${y}\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_double_click'] = function(block, generator) {
|
||||
const selector = generator.getFieldValue(block, 'SELECTOR');
|
||||
return `DOUBLE_CLICK \`${selector}\`\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_right_click'] = function(block, generator) {
|
||||
const selector = generator.getFieldValue(block, 'SELECTOR');
|
||||
return `RIGHT_CLICK \`${selector}\`\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_move'] = function(block, generator) {
|
||||
const x = generator.getFieldValue(block, 'X');
|
||||
const y = generator.getFieldValue(block, 'Y');
|
||||
return `MOVE ${x} ${y}\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_drag'] = function(block, generator) {
|
||||
const x1 = generator.getFieldValue(block, 'X1');
|
||||
const y1 = generator.getFieldValue(block, 'Y1');
|
||||
const x2 = generator.getFieldValue(block, 'X2');
|
||||
const y2 = generator.getFieldValue(block, 'Y2');
|
||||
return `DRAG ${x1} ${y1} ${x2} ${y2}\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_scroll'] = function(block, generator) {
|
||||
const direction = generator.getFieldValue(block, 'DIRECTION');
|
||||
const amount = generator.getFieldValue(block, 'AMOUNT');
|
||||
return `SCROLL ${direction} ${amount}\n`;
|
||||
};
|
||||
|
||||
// Keyboard generators
|
||||
c4aGenerator.forBlock['c4a_type'] = function(block, generator) {
|
||||
const text = generator.getFieldValue(block, 'TEXT');
|
||||
return `TYPE "${text}"\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_type_var'] = function(block, generator) {
|
||||
const varName = generator.getFieldValue(block, 'VAR');
|
||||
return `TYPE $${varName}\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_clear'] = function(block, generator) {
|
||||
const selector = generator.getFieldValue(block, 'SELECTOR');
|
||||
return `CLEAR \`${selector}\`\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_set'] = function(block, generator) {
|
||||
const selector = generator.getFieldValue(block, 'SELECTOR');
|
||||
const value = generator.getFieldValue(block, 'VALUE');
|
||||
return `SET \`${selector}\` "${value}"\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_press'] = function(block, generator) {
|
||||
const key = generator.getFieldValue(block, 'KEY');
|
||||
return `PRESS ${key}\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_key_down'] = function(block, generator) {
|
||||
const key = generator.getFieldValue(block, 'KEY');
|
||||
return `KEY_DOWN ${key}\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_key_up'] = function(block, generator) {
|
||||
const key = generator.getFieldValue(block, 'KEY');
|
||||
return `KEY_UP ${key}\n`;
|
||||
};
|
||||
|
||||
// Control flow generators
|
||||
c4aGenerator.forBlock['c4a_if_exists'] = function(block, generator) {
|
||||
const selector = generator.getFieldValue(block, 'SELECTOR');
|
||||
const thenCode = generator.statementToCode(block, 'THEN').trim();
|
||||
|
||||
if (thenCode.includes('\n')) {
|
||||
// Multi-line then block
|
||||
const lines = thenCode.split('\n').filter(line => line.trim());
|
||||
return lines.map(line => `IF (EXISTS \`${selector}\`) THEN ${line}`).join('\n') + '\n';
|
||||
} else if (thenCode) {
|
||||
// Single line
|
||||
return `IF (EXISTS \`${selector}\`) THEN ${thenCode}\n`;
|
||||
}
|
||||
return '';
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_if_exists_else'] = function(block, generator) {
|
||||
const selector = generator.getFieldValue(block, 'SELECTOR');
|
||||
const thenCode = generator.statementToCode(block, 'THEN').trim();
|
||||
const elseCode = generator.statementToCode(block, 'ELSE').trim();
|
||||
|
||||
// For simplicity, only handle single-line then/else
|
||||
const thenLine = thenCode.split('\n')[0];
|
||||
const elseLine = elseCode.split('\n')[0];
|
||||
|
||||
if (thenLine && elseLine) {
|
||||
return `IF (EXISTS \`${selector}\`) THEN ${thenLine} ELSE ${elseLine}\n`;
|
||||
} else if (thenLine) {
|
||||
return `IF (EXISTS \`${selector}\`) THEN ${thenLine}\n`;
|
||||
}
|
||||
return '';
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_if_not_exists'] = function(block, generator) {
|
||||
const selector = generator.getFieldValue(block, 'SELECTOR');
|
||||
const thenCode = generator.statementToCode(block, 'THEN').trim();
|
||||
|
||||
if (thenCode.includes('\n')) {
|
||||
const lines = thenCode.split('\n').filter(line => line.trim());
|
||||
return lines.map(line => `IF (NOT EXISTS \`${selector}\`) THEN ${line}`).join('\n') + '\n';
|
||||
} else if (thenCode) {
|
||||
return `IF (NOT EXISTS \`${selector}\`) THEN ${thenCode}\n`;
|
||||
}
|
||||
return '';
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_if_js'] = function(block, generator) {
|
||||
const condition = generator.getFieldValue(block, 'CONDITION');
|
||||
const thenCode = generator.statementToCode(block, 'THEN').trim();
|
||||
|
||||
if (thenCode.includes('\n')) {
|
||||
const lines = thenCode.split('\n').filter(line => line.trim());
|
||||
return lines.map(line => `IF (\`${condition}\`) THEN ${line}`).join('\n') + '\n';
|
||||
} else if (thenCode) {
|
||||
return `IF (\`${condition}\`) THEN ${thenCode}\n`;
|
||||
}
|
||||
return '';
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_repeat_times'] = function(block, generator) {
|
||||
const times = generator.getFieldValue(block, 'TIMES');
|
||||
const doCode = generator.statementToCode(block, 'DO').trim();
|
||||
|
||||
if (doCode) {
|
||||
// Get first command for repeat
|
||||
const firstLine = doCode.split('\n')[0];
|
||||
return `REPEAT (${firstLine}, ${times})\n`;
|
||||
}
|
||||
return '';
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_repeat_while'] = function(block, generator) {
|
||||
const condition = generator.getFieldValue(block, 'CONDITION');
|
||||
const doCode = generator.statementToCode(block, 'DO').trim();
|
||||
|
||||
if (doCode) {
|
||||
// Get first command for repeat
|
||||
const firstLine = doCode.split('\n')[0];
|
||||
return `REPEAT (${firstLine}, \`${condition}\`)\n`;
|
||||
}
|
||||
return '';
|
||||
};
|
||||
|
||||
// Variable generators
|
||||
c4aGenerator.forBlock['c4a_setvar'] = function(block, generator) {
|
||||
const name = generator.getFieldValue(block, 'NAME');
|
||||
const value = generator.getFieldValue(block, 'VALUE');
|
||||
return `SETVAR ${name} = "${value}"\n`;
|
||||
};
|
||||
|
||||
// Advanced generators
|
||||
c4aGenerator.forBlock['c4a_eval'] = function(block, generator) {
|
||||
const code = generator.getFieldValue(block, 'CODE');
|
||||
return `EVAL \`${code}\`\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_comment'] = function(block, generator) {
|
||||
const text = generator.getFieldValue(block, 'TEXT');
|
||||
return `# ${text}\n`;
|
||||
};
|
||||
|
||||
// Procedure generators
|
||||
c4aGenerator.forBlock['c4a_proc_def'] = function(block, generator) {
|
||||
const name = generator.getFieldValue(block, 'NAME');
|
||||
const body = generator.statementToCode(block, 'BODY');
|
||||
return `PROC ${name}\n${body}ENDPROC\n`;
|
||||
};
|
||||
|
||||
c4aGenerator.forBlock['c4a_proc_call'] = function(block, generator) {
|
||||
const name = generator.getFieldValue(block, 'NAME');
|
||||
return `${name}\n`;
|
||||
};
|
||||
|
||||
// Override scrub_ to handle our custom format
|
||||
c4aGenerator.scrub_ = function(block, code, opt_thisOnly) {
|
||||
const nextBlock = block.nextConnection && block.nextConnection.targetBlock();
|
||||
let nextCode = '';
|
||||
|
||||
if (nextBlock) {
|
||||
if (!opt_thisOnly) {
|
||||
nextCode = c4aGenerator.blockToCode(nextBlock);
|
||||
|
||||
// Add blank line between comment and non-comment blocks
|
||||
const currentIsComment = block.type === 'c4a_comment';
|
||||
const nextIsComment = nextBlock.type === 'c4a_comment';
|
||||
|
||||
// Add blank line when transitioning from command to comment or vice versa
|
||||
if (currentIsComment !== nextIsComment && code.trim() && nextCode.trim()) {
|
||||
nextCode = '\n' + nextCode;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return code + nextCode;
|
||||
};
|
||||
531
docs/md_v2/apps/c4a-script/assets/styles.css
Normal file
@@ -0,0 +1,531 @@
|
||||
/* DankMono Font Faces */
|
||||
@font-face {
|
||||
font-family: 'DankMono';
|
||||
src: url('DankMono-Regular.woff2') format('woff2');
|
||||
font-weight: 400;
|
||||
font-style: normal;
|
||||
}
|
||||
|
||||
@font-face {
|
||||
font-family: 'DankMono';
|
||||
src: url('DankMono-Bold.woff2') format('woff2');
|
||||
font-weight: 700;
|
||||
font-style: normal;
|
||||
}
|
||||
|
||||
@font-face {
|
||||
font-family: 'DankMono';
|
||||
src: url('DankMono-Italic.woff2') format('woff2');
|
||||
font-weight: 400;
|
||||
font-style: italic;
|
||||
}
|
||||
|
||||
/* Root Variables - Matching docs theme */
|
||||
:root {
|
||||
--global-font-size: 14px;
|
||||
--global-code-font-size: 13px;
|
||||
--global-line-height: 1.5em;
|
||||
--global-space: 10px;
|
||||
--font-stack: DankMono, Monaco, Courier New, monospace;
|
||||
--mono-font-stack: DankMono, Monaco, Courier New, monospace;
|
||||
|
||||
--background-color: #070708;
|
||||
--font-color: #e8e9ed;
|
||||
--invert-font-color: #222225;
|
||||
--secondary-color: #d5cec0;
|
||||
--tertiary-color: #a3abba;
|
||||
--primary-color: #0fbbaa;
|
||||
--error-color: #ff3c74;
|
||||
--progress-bar-background: #3f3f44;
|
||||
--progress-bar-fill: #09b5a5;
|
||||
--code-bg-color: #3f3f44;
|
||||
--block-background-color: #202020;
|
||||
|
||||
--header-height: 55px;
|
||||
}
|
||||
|
||||
/* Base Styles */
|
||||
* {
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
body {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
font-family: var(--font-stack);
|
||||
font-size: var(--global-font-size);
|
||||
line-height: var(--global-line-height);
|
||||
color: var(--font-color);
|
||||
background-color: var(--background-color);
|
||||
}
|
||||
|
||||
/* Terminal Framework */
|
||||
.terminal {
|
||||
min-height: 100vh;
|
||||
}
|
||||
|
||||
.container {
|
||||
width: 100%;
|
||||
margin: 0 auto;
|
||||
}
|
||||
|
||||
/* Header */
|
||||
.header-container {
|
||||
position: fixed;
|
||||
top: 0;
|
||||
left: 0;
|
||||
right: 0;
|
||||
height: var(--header-height);
|
||||
background-color: var(--background-color);
|
||||
border-bottom: 1px solid var(--progress-bar-background);
|
||||
z-index: 1000;
|
||||
padding: 0 calc(var(--global-space) * 2);
|
||||
}
|
||||
|
||||
.terminal-nav {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: space-between;
|
||||
height: 100%;
|
||||
}
|
||||
|
||||
.terminal-logo h1 {
|
||||
margin: 0;
|
||||
font-size: 1.2em;
|
||||
color: var(--primary-color);
|
||||
font-weight: 400;
|
||||
}
|
||||
|
||||
.terminal-menu ul {
|
||||
list-style: none;
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
display: flex;
|
||||
gap: 2em;
|
||||
}
|
||||
|
||||
.terminal-menu a {
|
||||
color: var(--secondary-color);
|
||||
text-decoration: none;
|
||||
transition: color 0.2s;
|
||||
}
|
||||
|
||||
.terminal-menu a:hover,
|
||||
.terminal-menu a.active {
|
||||
color: var(--primary-color);
|
||||
}
|
||||
|
||||
/* Main Container */
|
||||
.main-container {
|
||||
padding-top: calc(var(--header-height) + 2em);
|
||||
padding-left: 2em;
|
||||
padding-right: 2em;
|
||||
max-width: 1400px;
|
||||
margin: 0 auto;
|
||||
}
|
||||
|
||||
/* Tutorial Grid */
|
||||
.tutorial-grid {
|
||||
display: grid;
|
||||
grid-template-columns: 1fr 1fr;
|
||||
gap: 2em;
|
||||
align-items: start;
|
||||
}
|
||||
|
||||
/* Terminal Cards */
|
||||
.terminal-card {
|
||||
background-color: var(--block-background-color);
|
||||
border: 1px solid var(--progress-bar-background);
|
||||
margin-bottom: 1.5em;
|
||||
}
|
||||
|
||||
.terminal-card header {
|
||||
background-color: var(--progress-bar-background);
|
||||
padding: 0.8em 1em;
|
||||
font-weight: 700;
|
||||
color: var(--font-color);
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
}
|
||||
|
||||
.terminal-card > div {
|
||||
padding: 1.5em;
|
||||
}
|
||||
|
||||
/* Editor Section */
|
||||
.editor-controls {
|
||||
display: flex;
|
||||
gap: 0.5em;
|
||||
}
|
||||
|
||||
.editor-container {
|
||||
height: 300px;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
#c4a-editor {
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
font-family: var(--mono-font-stack);
|
||||
font-size: var(--global-code-font-size);
|
||||
background-color: var(--code-bg-color);
|
||||
color: var(--font-color);
|
||||
border: none;
|
||||
padding: 1em;
|
||||
resize: none;
|
||||
}
|
||||
|
||||
/* JS Output */
|
||||
.js-output-container {
|
||||
max-height: 300px;
|
||||
overflow-y: auto;
|
||||
}
|
||||
|
||||
.js-output-container pre {
|
||||
margin: 0;
|
||||
padding: 1em;
|
||||
background-color: var(--code-bg-color);
|
||||
}
|
||||
|
||||
.js-output-container code {
|
||||
font-family: var(--mono-font-stack);
|
||||
font-size: var(--global-code-font-size);
|
||||
color: var(--font-color);
|
||||
white-space: pre-wrap;
|
||||
}
|
||||
|
||||
/* Console Output */
|
||||
.console-output {
|
||||
font-family: var(--mono-font-stack);
|
||||
font-size: var(--global-code-font-size);
|
||||
max-height: 200px;
|
||||
overflow-y: auto;
|
||||
padding: 1em;
|
||||
}
|
||||
|
||||
.console-line {
|
||||
margin-bottom: 0.5em;
|
||||
}
|
||||
|
||||
.console-prompt {
|
||||
color: var(--primary-color);
|
||||
margin-right: 0.5em;
|
||||
}
|
||||
|
||||
.console-text {
|
||||
color: var(--font-color);
|
||||
}
|
||||
|
||||
.console-error {
|
||||
color: var(--error-color);
|
||||
}
|
||||
|
||||
.console-success {
|
||||
color: var(--primary-color);
|
||||
}
|
||||
|
||||
/* Playground */
|
||||
.playground-container {
|
||||
height: 600px;
|
||||
background-color: #fff;
|
||||
border: 1px solid var(--progress-bar-background);
|
||||
}
|
||||
|
||||
#playground-frame {
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
border: none;
|
||||
}
|
||||
|
||||
/* Execution Progress */
|
||||
.execution-progress {
|
||||
padding: 1em;
|
||||
}
|
||||
|
||||
.progress-item {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 0.8em;
|
||||
margin-bottom: 0.8em;
|
||||
color: var(--secondary-color);
|
||||
}
|
||||
|
||||
.progress-item.active {
|
||||
color: var(--primary-color);
|
||||
}
|
||||
|
||||
.progress-item.completed {
|
||||
color: var(--tertiary-color);
|
||||
}
|
||||
|
||||
.progress-item.error {
|
||||
color: var(--error-color);
|
||||
}
|
||||
|
||||
.progress-icon {
|
||||
font-size: 1.2em;
|
||||
}
|
||||
|
||||
/* Buttons */
|
||||
.btn {
|
||||
background-color: var(--primary-color);
|
||||
color: var(--background-color);
|
||||
border: none;
|
||||
padding: 0.5em 1em;
|
||||
font-family: var(--font-stack);
|
||||
font-size: 0.9em;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s;
|
||||
}
|
||||
|
||||
.btn:hover {
|
||||
background-color: var(--progress-bar-fill);
|
||||
}
|
||||
|
||||
.btn-sm {
|
||||
padding: 0.3em 0.8em;
|
||||
font-size: 0.85em;
|
||||
}
|
||||
|
||||
.btn-ghost {
|
||||
background-color: transparent;
|
||||
color: var(--secondary-color);
|
||||
border: 1px solid var(--progress-bar-background);
|
||||
}
|
||||
|
||||
.btn-ghost:hover {
|
||||
background-color: var(--progress-bar-background);
|
||||
color: var(--font-color);
|
||||
}
|
||||
|
||||
/* Scrollbars */
|
||||
::-webkit-scrollbar {
|
||||
width: 8px;
|
||||
height: 8px;
|
||||
}
|
||||
|
||||
::-webkit-scrollbar-track {
|
||||
background: var(--block-background-color);
|
||||
}
|
||||
|
||||
::-webkit-scrollbar-thumb {
|
||||
background: var(--progress-bar-background);
|
||||
}
|
||||
|
||||
::-webkit-scrollbar-thumb:hover {
|
||||
background: var(--secondary-color);
|
||||
}
|
||||
|
||||
/* CodeMirror Theme Override */
|
||||
.CodeMirror {
|
||||
font-family: var(--mono-font-stack) !important;
|
||||
font-size: var(--global-code-font-size) !important;
|
||||
background-color: var(--code-bg-color) !important;
|
||||
color: var(--font-color) !important;
|
||||
height: 100% !important;
|
||||
}
|
||||
|
||||
.CodeMirror-gutters {
|
||||
background-color: var(--progress-bar-background) !important;
|
||||
border-right: 1px solid var(--progress-bar-background) !important;
|
||||
}
|
||||
|
||||
/* Responsive */
|
||||
@media (max-width: 1200px) {
|
||||
.tutorial-grid {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
|
||||
.playground-section {
|
||||
order: -1;
|
||||
}
|
||||
}
|
||||
|
||||
/* Links */
|
||||
a {
|
||||
color: var(--primary-color);
|
||||
text-decoration: none;
|
||||
}
|
||||
|
||||
a:hover {
|
||||
text-decoration: underline;
|
||||
}
|
||||
|
||||
/* Lists */
|
||||
ul, ol {
|
||||
padding-left: 2em;
|
||||
}
|
||||
|
||||
li {
|
||||
margin-bottom: 0.5em;
|
||||
}
|
||||
|
||||
/* Code */
|
||||
code {
|
||||
background-color: var(--code-bg-color);
|
||||
padding: 0.2em 0.4em;
|
||||
font-family: var(--mono-font-stack);
|
||||
font-size: 0.9em;
|
||||
}
|
||||
|
||||
/* Headings */
|
||||
h1, h2, h3, h4, h5, h6 {
|
||||
font-weight: 700;
|
||||
margin-top: 1.5em;
|
||||
margin-bottom: 0.8em;
|
||||
}
|
||||
|
||||
h3 {
|
||||
color: var(--primary-color);
|
||||
font-size: 1.1em;
|
||||
}
|
||||
|
||||
/* Tutorial Panel */
|
||||
.tutorial-panel {
|
||||
position: absolute;
|
||||
top: 60px;
|
||||
right: 20px;
|
||||
width: 380px;
|
||||
background: #1a1a1b;
|
||||
border: 1px solid #2a2a2c;
|
||||
border-radius: 8px;
|
||||
box-shadow: 0 8px 32px rgba(0, 0, 0, 0.5);
|
||||
z-index: 1000;
|
||||
transition: all 0.3s ease;
|
||||
}
|
||||
|
||||
.tutorial-panel.hidden {
|
||||
display: none;
|
||||
}
|
||||
|
||||
.tutorial-header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
padding: 16px 20px;
|
||||
border-bottom: 1px solid #2a2a2c;
|
||||
}
|
||||
|
||||
.tutorial-header h3 {
|
||||
margin: 0;
|
||||
color: #0fbbaa;
|
||||
font-size: 18px;
|
||||
}
|
||||
|
||||
.close-btn {
|
||||
background: none;
|
||||
border: none;
|
||||
color: #8b8b8d;
|
||||
font-size: 24px;
|
||||
cursor: pointer;
|
||||
padding: 0;
|
||||
width: 30px;
|
||||
height: 30px;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
border-radius: 4px;
|
||||
transition: all 0.2s;
|
||||
}
|
||||
|
||||
.close-btn:hover {
|
||||
background: #2a2a2c;
|
||||
color: #e0e0e0;
|
||||
}
|
||||
|
||||
.tutorial-content {
|
||||
padding: 20px;
|
||||
}
|
||||
|
||||
.tutorial-content p {
|
||||
margin: 0 0 16px 0;
|
||||
color: #e0e0e0;
|
||||
line-height: 1.6;
|
||||
}
|
||||
|
||||
.tutorial-progress {
|
||||
margin-top: 16px;
|
||||
}
|
||||
|
||||
.tutorial-progress span {
|
||||
display: block;
|
||||
margin-bottom: 8px;
|
||||
color: #8b8b8d;
|
||||
font-size: 12px;
|
||||
text-transform: uppercase;
|
||||
}
|
||||
|
||||
.progress-bar {
|
||||
height: 4px;
|
||||
background: #2a2a2c;
|
||||
border-radius: 2px;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.progress-fill {
|
||||
height: 100%;
|
||||
background: #0fbbaa;
|
||||
transition: width 0.3s ease;
|
||||
}
|
||||
|
||||
.tutorial-actions {
|
||||
display: flex;
|
||||
gap: 12px;
|
||||
padding: 0 20px 20px;
|
||||
}
|
||||
|
||||
.tutorial-btn {
|
||||
flex: 1;
|
||||
padding: 10px 16px;
|
||||
background: #2a2a2c;
|
||||
color: #e0e0e0;
|
||||
border: 1px solid #3a3a3c;
|
||||
border-radius: 6px;
|
||||
font-size: 14px;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s;
|
||||
}
|
||||
|
||||
.tutorial-btn:hover:not(:disabled) {
|
||||
background: #3a3a3c;
|
||||
transform: translateY(-1px);
|
||||
}
|
||||
|
||||
.tutorial-btn:disabled {
|
||||
opacity: 0.5;
|
||||
cursor: not-allowed;
|
||||
}
|
||||
|
||||
.tutorial-btn.primary {
|
||||
background: #0fbbaa;
|
||||
color: #070708;
|
||||
border-color: #0fbbaa;
|
||||
}
|
||||
|
||||
.tutorial-btn.primary:hover {
|
||||
background: #0da89a;
|
||||
border-color: #0da89a;
|
||||
}
|
||||
|
||||
/* Tutorial Highlights */
|
||||
.tutorial-highlight {
|
||||
position: relative;
|
||||
animation: pulse 2s infinite;
|
||||
}
|
||||
|
||||
@keyframes pulse {
|
||||
0% {
|
||||
box-shadow: 0 0 0 0 rgba(15, 187, 170, 0.4);
|
||||
}
|
||||
50% {
|
||||
box-shadow: 0 0 0 10px rgba(15, 187, 170, 0);
|
||||
}
|
||||
100% {
|
||||
box-shadow: 0 0 0 0 rgba(15, 187, 170, 0);
|
||||
}
|
||||
}
|
||||
|
||||
.editor-card {
|
||||
position: relative;
|
||||
}
|
||||
21
docs/md_v2/apps/c4a-script/blockly-demo.c4a
Normal file
@@ -0,0 +1,21 @@
|
||||
# Demo: Login Flow with Blockly
|
||||
# This script can be created visually using Blockly blocks
|
||||
|
||||
GO https://example.com/login
|
||||
WAIT `#login-form` 5
|
||||
|
||||
# Check if already logged in
|
||||
IF (EXISTS `.user-avatar`) THEN GO https://example.com/dashboard
|
||||
|
||||
# Fill login form
|
||||
CLICK `#email`
|
||||
TYPE "demo@example.com"
|
||||
CLICK `#password`
|
||||
TYPE "password123"
|
||||
|
||||
# Submit form
|
||||
CLICK `button[type="submit"]`
|
||||
WAIT `.dashboard` 10
|
||||
|
||||
# Success message
|
||||
EVAL `console.log('Login successful!')`
|
||||
205
docs/md_v2/apps/c4a-script/index.html
Normal file
@@ -0,0 +1,205 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>C4A-Script Interactive Tutorial | Crawl4AI</title>
|
||||
<link rel="stylesheet" href="assets/app.css">
|
||||
<link rel="stylesheet" href="assets/blockly-theme.css">
|
||||
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/codemirror/6.65.7/codemirror.min.css">
|
||||
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/codemirror/6.65.7/theme/material-darker.min.css">
|
||||
</head>
|
||||
<body>
|
||||
<!-- Tutorial Intro Modal -->
|
||||
<div id="tutorial-intro" class="tutorial-intro-modal">
|
||||
<div class="intro-content">
|
||||
<h2>Welcome to C4A-Script Tutorial!</h2>
|
||||
<p>C4A-Script is a simple language for web automation. This interactive tutorial will teach you:</p>
|
||||
<ul>
|
||||
<li>How to handle popups and banners</li>
|
||||
<li>Form filling and navigation</li>
|
||||
<li>Advanced automation techniques</li>
|
||||
</ul>
|
||||
<div class="intro-actions">
|
||||
<button id="start-tutorial-btn" class="intro-btn primary">Start Tutorial</button>
|
||||
<button id="skip-tutorial-btn" class="intro-btn">Skip</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Event Editor Modal -->
|
||||
<div id="event-editor-overlay" class="modal-overlay hidden"></div>
|
||||
<div id="event-editor-modal" class="event-editor-modal hidden">
|
||||
<h4>Edit Event</h4>
|
||||
<div class="editor-field">
|
||||
<label>Command Type</label>
|
||||
<select id="edit-command-type" disabled>
|
||||
<option value="CLICK">CLICK</option>
|
||||
<option value="DOUBLE_CLICK">DOUBLE_CLICK</option>
|
||||
<option value="RIGHT_CLICK">RIGHT_CLICK</option>
|
||||
<option value="TYPE">TYPE</option>
|
||||
<option value="SET">SET</option>
|
||||
<option value="SCROLL">SCROLL</option>
|
||||
<option value="WAIT">WAIT</option>
|
||||
</select>
|
||||
</div>
|
||||
<div id="edit-selector-field" class="editor-field">
|
||||
<label>Selector</label>
|
||||
<input type="text" id="edit-selector" placeholder=".class or #id">
|
||||
</div>
|
||||
<div id="edit-value-field" class="editor-field">
|
||||
<label>Value</label>
|
||||
<input type="text" id="edit-value" placeholder="Text or number">
|
||||
</div>
|
||||
<div id="edit-direction-field" class="editor-field hidden">
|
||||
<label>Direction</label>
|
||||
<select id="edit-direction">
|
||||
<option value="UP">UP</option>
|
||||
<option value="DOWN">DOWN</option>
|
||||
<option value="LEFT">LEFT</option>
|
||||
<option value="RIGHT">RIGHT</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="editor-actions">
|
||||
<button id="edit-cancel" class="mini-btn">Cancel</button>
|
||||
<button id="edit-save" class="mini-btn primary">Save</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Main App Layout -->
|
||||
<div class="app-container">
|
||||
<!-- Left Panel: Editor -->
|
||||
<div class="editor-panel">
|
||||
<div class="panel-header">
|
||||
<h2>C4A-Script Editor</h2>
|
||||
<div class="header-actions">
|
||||
<button id="tutorial-btn" class="action-btn" title="Tutorial">
|
||||
<span class="icon">📚</span>
|
||||
</button>
|
||||
<button id="examples-btn" class="action-btn" title="Examples">
|
||||
<span class="icon">📋</span>
|
||||
</button>
|
||||
<button id="clear-btn" class="action-btn" title="Clear">
|
||||
<span class="icon">🗑</span>
|
||||
</button>
|
||||
<button id="run-btn" class="action-btn primary">
|
||||
<span class="icon">▶</span>Run
|
||||
</button>
|
||||
<button id="record-btn" class="action-btn record">
|
||||
<span class="icon">⏺</span>Record
|
||||
</button>
|
||||
<button id="timeline-btn" class="action-btn timeline hidden" title="View Timeline">
|
||||
<span class="icon">📊</span>
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="editor-container">
|
||||
<div id="editor-view" class="editor-wrapper">
|
||||
<textarea id="c4a-editor" placeholder="# Write your C4A script here..."></textarea>
|
||||
</div>
|
||||
|
||||
<!-- Recording Timeline -->
|
||||
<div id="timeline-view" class="recording-timeline hidden">
|
||||
<div class="timeline-header">
|
||||
<h3>Recording Timeline</h3>
|
||||
<div class="timeline-actions">
|
||||
<button id="back-to-editor" class="mini-btn">← Back</button>
|
||||
<button id="select-all-events" class="mini-btn">Select All</button>
|
||||
<button id="clear-events" class="mini-btn">Clear</button>
|
||||
<button id="generate-script" class="mini-btn primary">Generate Script</button>
|
||||
</div>
|
||||
</div>
|
||||
<div id="timeline-events" class="timeline-events">
|
||||
<!-- Events will be added here dynamically -->
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Bottom: Output Tabs -->
|
||||
<div class="output-section">
|
||||
<div class="tabs">
|
||||
<button class="tab active" data-tab="console">Console</button>
|
||||
<button class="tab" data-tab="javascript">Generated JS</button>
|
||||
</div>
|
||||
<div class="tab-content">
|
||||
<div id="console-tab" class="tab-pane active">
|
||||
<div id="console-output" class="console">
|
||||
<div class="console-line">
|
||||
<span class="console-prompt">$</span>
|
||||
<span class="console-text">Ready to run C4A scripts...</span>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div id="javascript-tab" class="tab-pane">
|
||||
<div class="js-output-header">
|
||||
<div class="js-actions">
|
||||
<button id="copy-js-btn" class="mini-btn" title="Copy">
|
||||
<span>📋</span>
|
||||
</button>
|
||||
<button id="edit-js-btn" class="mini-btn" title="Edit">
|
||||
<span>✏️</span>
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
<pre id="js-output" class="js-output">// JavaScript will appear here...</pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Right Panel: Playground -->
|
||||
<div class="playground-panel">
|
||||
<div class="panel-header">
|
||||
<h2>Playground</h2>
|
||||
<div class="header-actions">
|
||||
<button id="reset-playground" class="action-btn" title="Reset">
|
||||
<span class="icon">🔄</span>
|
||||
</button>
|
||||
<button id="fullscreen-btn" class="action-btn" title="Fullscreen">
|
||||
<span class="icon">⛶</span>
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
<div class="playground-wrapper">
|
||||
<iframe id="playground-frame" src="playground/" title="Playground"></iframe>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Tutorial Navigation Bar -->
|
||||
<div id="tutorial-nav" class="tutorial-nav hidden">
|
||||
<div class="tutorial-nav-content">
|
||||
<div class="tutorial-left">
|
||||
<div class="tutorial-step-title">
|
||||
<span id="tutorial-step-info">Step 1 of 9</span>
|
||||
<span id="tutorial-title">Welcome</span>
|
||||
</div>
|
||||
<p id="tutorial-description" class="tutorial-description">Let's start by waiting for the page to load.</p>
|
||||
</div>
|
||||
<div class="tutorial-right">
|
||||
<div class="tutorial-controls">
|
||||
<button id="tutorial-prev" class="nav-btn" disabled>← Previous</button>
|
||||
<button id="tutorial-next" class="nav-btn primary">Next →</button>
|
||||
</div>
|
||||
<button id="tutorial-exit" class="exit-btn" title="Exit Tutorial">×</button>
|
||||
</div>
|
||||
</div>
|
||||
<div class="tutorial-progress-bar">
|
||||
<div id="tutorial-progress-fill" class="progress-fill"></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Scripts -->
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/codemirror/6.65.7/codemirror.min.js"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/codemirror/6.65.7/mode/javascript/javascript.min.js"></script>
|
||||
|
||||
<!-- Blockly -->
|
||||
<script src="https://unpkg.com/blockly/blockly.min.js"></script>
|
||||
<script src="assets/c4a-blocks.js"></script>
|
||||
<script src="assets/c4a-generator.js"></script>
|
||||
<script src="assets/blockly-manager.js"></script>
|
||||
|
||||
<script src="assets/app.js"></script>
|
||||
</body>
|
||||
</html>
|
||||
604
docs/md_v2/apps/c4a-script/playground/app.js
Normal file
@@ -0,0 +1,604 @@
|
||||
// Playground App JavaScript
|
||||
class PlaygroundApp {
|
||||
constructor() {
|
||||
this.isLoggedIn = false;
|
||||
this.currentSection = 'home';
|
||||
this.productsLoaded = 0;
|
||||
this.maxProducts = 100;
|
||||
this.tableRowsLoaded = 10;
|
||||
this.inspectorMode = false;
|
||||
this.tooltip = null;
|
||||
|
||||
this.init();
|
||||
}
|
||||
|
||||
init() {
|
||||
this.setupCookieBanner();
|
||||
this.setupNewsletterPopup();
|
||||
this.setupNavigation();
|
||||
this.setupAuth();
|
||||
this.setupProductCatalog();
|
||||
this.setupForms();
|
||||
this.setupTabs();
|
||||
this.setupDataTable();
|
||||
this.setupInspector();
|
||||
this.loadInitialData();
|
||||
}
|
||||
|
||||
// Cookie Banner
|
||||
setupCookieBanner() {
|
||||
const banner = document.getElementById('cookie-banner');
|
||||
const acceptBtn = banner.querySelector('.accept');
|
||||
const declineBtn = banner.querySelector('.decline');
|
||||
|
||||
acceptBtn.addEventListener('click', () => {
|
||||
banner.style.display = 'none';
|
||||
console.log('✅ Cookies accepted');
|
||||
});
|
||||
|
||||
declineBtn.addEventListener('click', () => {
|
||||
banner.style.display = 'none';
|
||||
console.log('❌ Cookies declined');
|
||||
});
|
||||
}
|
||||
|
||||
// Newsletter Popup
|
||||
setupNewsletterPopup() {
|
||||
const popup = document.getElementById('newsletter-popup');
|
||||
const closeBtn = popup.querySelector('.close');
|
||||
const subscribeBtn = popup.querySelector('.subscribe');
|
||||
|
||||
// Show popup after 3 seconds
|
||||
setTimeout(() => {
|
||||
popup.style.display = 'flex';
|
||||
}, 3000);
|
||||
|
||||
closeBtn.addEventListener('click', () => {
|
||||
popup.style.display = 'none';
|
||||
});
|
||||
|
||||
subscribeBtn.addEventListener('click', () => {
|
||||
const email = popup.querySelector('input').value;
|
||||
if (email) {
|
||||
console.log(`📧 Subscribed: ${email}`);
|
||||
popup.style.display = 'none';
|
||||
}
|
||||
});
|
||||
|
||||
// Close on outside click
|
||||
popup.addEventListener('click', (e) => {
|
||||
if (e.target === popup) {
|
||||
popup.style.display = 'none';
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// Navigation
|
||||
setupNavigation() {
|
||||
const navLinks = document.querySelectorAll('.nav-link');
|
||||
const sections = document.querySelectorAll('.section');
|
||||
|
||||
navLinks.forEach(link => {
|
||||
link.addEventListener('click', (e) => {
|
||||
e.preventDefault();
|
||||
const targetId = link.getAttribute('href').substring(1);
|
||||
|
||||
// Update active states
|
||||
navLinks.forEach(l => l.classList.remove('active'));
|
||||
link.classList.add('active');
|
||||
|
||||
// Show target section
|
||||
sections.forEach(s => s.classList.remove('active'));
|
||||
const targetSection = document.getElementById(targetId);
|
||||
if (targetSection) {
|
||||
targetSection.classList.add('active');
|
||||
this.currentSection = targetId;
|
||||
|
||||
// Load content for specific sections
|
||||
this.loadSectionContent(targetId);
|
||||
}
|
||||
});
|
||||
});
|
||||
|
||||
// Start tutorial button
|
||||
const startBtn = document.getElementById('start-tutorial');
|
||||
if (startBtn) {
|
||||
startBtn.addEventListener('click', () => {
|
||||
console.log('🚀 Tutorial started!');
|
||||
alert('Tutorial started! Check the console for progress.');
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Authentication
|
||||
setupAuth() {
|
||||
const loginBtn = document.getElementById('login-btn');
|
||||
const logoutBtn = document.getElementById('logout-btn');
|
||||
const loginModal = document.getElementById('login-modal');
|
||||
const loginForm = document.getElementById('login-form');
|
||||
const closeBtn = loginModal.querySelector('.close');
|
||||
|
||||
loginBtn.addEventListener('click', () => {
|
||||
loginModal.style.display = 'flex';
|
||||
});
|
||||
|
||||
closeBtn.addEventListener('click', () => {
|
||||
loginModal.style.display = 'none';
|
||||
});
|
||||
|
||||
loginForm.addEventListener('submit', (e) => {
|
||||
e.preventDefault();
|
||||
const email = document.getElementById('email').value;
|
||||
const password = document.getElementById('password').value;
|
||||
const rememberMe = document.getElementById('remember-me').checked;
|
||||
const messageEl = document.getElementById('login-message');
|
||||
|
||||
// Simple validation
|
||||
if (email === 'demo@example.com' && password === 'demo123') {
|
||||
this.isLoggedIn = true;
|
||||
messageEl.textContent = '✅ Login successful!';
|
||||
messageEl.className = 'form-message success';
|
||||
|
||||
setTimeout(() => {
|
||||
loginModal.style.display = 'none';
|
||||
document.getElementById('login-btn').style.display = 'none';
|
||||
document.getElementById('user-info').style.display = 'flex';
|
||||
document.getElementById('username-display').textContent = 'Demo User';
|
||||
console.log(`✅ Logged in${rememberMe ? ' (remembered)' : ''}`);
|
||||
}, 1000);
|
||||
} else {
|
||||
messageEl.textContent = '❌ Invalid credentials. Try demo@example.com / demo123';
|
||||
messageEl.className = 'form-message error';
|
||||
}
|
||||
});
|
||||
|
||||
logoutBtn.addEventListener('click', () => {
|
||||
this.isLoggedIn = false;
|
||||
document.getElementById('login-btn').style.display = 'block';
|
||||
document.getElementById('user-info').style.display = 'none';
|
||||
console.log('👋 Logged out');
|
||||
});
|
||||
|
||||
// Close modal on outside click
|
||||
loginModal.addEventListener('click', (e) => {
|
||||
if (e.target === loginModal) {
|
||||
loginModal.style.display = 'none';
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// Product Catalog
|
||||
setupProductCatalog() {
|
||||
// View toggle
|
||||
const infiniteBtn = document.getElementById('infinite-scroll-btn');
|
||||
const paginationBtn = document.getElementById('pagination-btn');
|
||||
const infiniteView = document.getElementById('infinite-scroll-view');
|
||||
const paginationView = document.getElementById('pagination-view');
|
||||
|
||||
infiniteBtn.addEventListener('click', () => {
|
||||
infiniteBtn.classList.add('active');
|
||||
paginationBtn.classList.remove('active');
|
||||
infiniteView.style.display = 'block';
|
||||
paginationView.style.display = 'none';
|
||||
this.setupInfiniteScroll();
|
||||
});
|
||||
|
||||
paginationBtn.addEventListener('click', () => {
|
||||
paginationBtn.classList.add('active');
|
||||
infiniteBtn.classList.remove('active');
|
||||
paginationView.style.display = 'block';
|
||||
infiniteView.style.display = 'none';
|
||||
});
|
||||
|
||||
// Load more button
|
||||
const loadMoreBtn = paginationView.querySelector('.load-more');
|
||||
loadMoreBtn.addEventListener('click', () => {
|
||||
this.loadMoreProducts();
|
||||
});
|
||||
|
||||
// Collapsible filters
|
||||
const collapsibles = document.querySelectorAll('.collapsible');
|
||||
collapsibles.forEach(header => {
|
||||
header.addEventListener('click', () => {
|
||||
const content = header.nextElementSibling;
|
||||
const toggle = header.querySelector('.toggle');
|
||||
content.style.display = content.style.display === 'none' ? 'block' : 'none';
|
||||
toggle.textContent = content.style.display === 'none' ? '▶' : '▼';
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
setupInfiniteScroll() {
|
||||
const container = document.querySelector('.products-container');
|
||||
const loadingIndicator = document.getElementById('loading-indicator');
|
||||
|
||||
container.addEventListener('scroll', () => {
|
||||
if (container.scrollTop + container.clientHeight >= container.scrollHeight - 100) {
|
||||
if (this.productsLoaded < this.maxProducts) {
|
||||
loadingIndicator.style.display = 'block';
|
||||
setTimeout(() => {
|
||||
this.loadMoreProducts();
|
||||
loadingIndicator.style.display = 'none';
|
||||
}, 1000);
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
loadMoreProducts() {
|
||||
const grid = document.getElementById('product-grid');
|
||||
const batch = 10;
|
||||
|
||||
for (let i = 0; i < batch && this.productsLoaded < this.maxProducts; i++) {
|
||||
const product = this.createProductCard(this.productsLoaded + 1);
|
||||
grid.appendChild(product);
|
||||
this.productsLoaded++;
|
||||
}
|
||||
|
||||
console.log(`📦 Loaded ${batch} more products. Total: ${this.productsLoaded}`);
|
||||
}
|
||||
|
||||
createProductCard(id) {
|
||||
const card = document.createElement('div');
|
||||
card.className = 'product-card';
|
||||
card.innerHTML = `
|
||||
<div class="product-image">📦</div>
|
||||
<div class="product-name">Product ${id}</div>
|
||||
<div class="product-price">$${(Math.random() * 100 + 10).toFixed(2)}</div>
|
||||
<button class="btn btn-sm">Quick View</button>
|
||||
`;
|
||||
|
||||
// Quick view functionality
|
||||
const quickViewBtn = card.querySelector('button');
|
||||
quickViewBtn.addEventListener('click', () => {
|
||||
alert(`Quick view for Product ${id}`);
|
||||
});
|
||||
|
||||
return card;
|
||||
}
|
||||
|
||||
// Forms
|
||||
setupForms() {
|
||||
// Contact Form
|
||||
const contactForm = document.getElementById('contact-form');
|
||||
const subjectSelect = document.getElementById('contact-subject');
|
||||
const departmentGroup = document.getElementById('department-group');
|
||||
const departmentSelect = document.getElementById('department');
|
||||
|
||||
subjectSelect.addEventListener('change', () => {
|
||||
if (subjectSelect.value === 'support') {
|
||||
departmentGroup.style.display = 'block';
|
||||
departmentSelect.innerHTML = `
|
||||
<option value="">Select department</option>
|
||||
<option value="technical">Technical Support</option>
|
||||
<option value="billing">Billing Support</option>
|
||||
<option value="general">General Support</option>
|
||||
`;
|
||||
} else {
|
||||
departmentGroup.style.display = 'none';
|
||||
}
|
||||
});
|
||||
|
||||
contactForm.addEventListener('submit', (e) => {
|
||||
e.preventDefault();
|
||||
const messageDisplay = document.getElementById('contact-message-display');
|
||||
messageDisplay.textContent = '✅ Message sent successfully!';
|
||||
messageDisplay.className = 'form-message success';
|
||||
console.log('📧 Contact form submitted');
|
||||
});
|
||||
|
||||
// Multi-step Form
|
||||
const surveyForm = document.getElementById('survey-form');
|
||||
const steps = surveyForm.querySelectorAll('.form-step');
|
||||
const progressFill = document.getElementById('progress-fill');
|
||||
let currentStep = 1;
|
||||
|
||||
surveyForm.addEventListener('click', (e) => {
|
||||
if (e.target.classList.contains('next-step')) {
|
||||
if (currentStep < 3) {
|
||||
steps[currentStep - 1].style.display = 'none';
|
||||
currentStep++;
|
||||
steps[currentStep - 1].style.display = 'block';
|
||||
progressFill.style.width = `${(currentStep / 3) * 100}%`;
|
||||
}
|
||||
} else if (e.target.classList.contains('prev-step')) {
|
||||
if (currentStep > 1) {
|
||||
steps[currentStep - 1].style.display = 'none';
|
||||
currentStep--;
|
||||
steps[currentStep - 1].style.display = 'block';
|
||||
progressFill.style.width = `${(currentStep / 3) * 100}%`;
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
surveyForm.addEventListener('submit', (e) => {
|
||||
e.preventDefault();
|
||||
document.getElementById('survey-success').style.display = 'block';
|
||||
console.log('📋 Survey submitted successfully!');
|
||||
});
|
||||
}
|
||||
|
||||
// Tabs
|
||||
setupTabs() {
|
||||
const tabBtns = document.querySelectorAll('.tab-btn');
|
||||
const tabPanes = document.querySelectorAll('.tab-pane');
|
||||
|
||||
tabBtns.forEach(btn => {
|
||||
btn.addEventListener('click', () => {
|
||||
const targetTab = btn.getAttribute('data-tab');
|
||||
|
||||
// Update active states
|
||||
tabBtns.forEach(b => b.classList.remove('active'));
|
||||
btn.classList.add('active');
|
||||
|
||||
// Show target pane
|
||||
tabPanes.forEach(pane => {
|
||||
pane.style.display = pane.id === targetTab ? 'block' : 'none';
|
||||
});
|
||||
});
|
||||
});
|
||||
|
||||
// Show more functionality
|
||||
const showMoreBtn = document.querySelector('.show-more');
|
||||
const hiddenText = document.querySelector('.hidden-text');
|
||||
|
||||
if (showMoreBtn) {
|
||||
showMoreBtn.addEventListener('click', () => {
|
||||
if (hiddenText.style.display === 'none') {
|
||||
hiddenText.style.display = 'block';
|
||||
showMoreBtn.textContent = 'Show Less';
|
||||
} else {
|
||||
hiddenText.style.display = 'none';
|
||||
showMoreBtn.textContent = 'Show More';
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// Load comments
|
||||
const loadCommentsBtn = document.querySelector('.load-comments');
|
||||
const commentsSection = document.querySelector('.comments-section');
|
||||
|
||||
if (loadCommentsBtn) {
|
||||
loadCommentsBtn.addEventListener('click', () => {
|
||||
commentsSection.style.display = 'block';
|
||||
commentsSection.innerHTML = `
|
||||
<div class="comment">
|
||||
<div class="comment-author">John Doe</div>
|
||||
<div class="comment-text">Great product! Highly recommended.</div>
|
||||
</div>
|
||||
<div class="comment">
|
||||
<div class="comment-author">Jane Smith</div>
|
||||
<div class="comment-text">Excellent quality and fast shipping.</div>
|
||||
</div>
|
||||
`;
|
||||
loadCommentsBtn.style.display = 'none';
|
||||
console.log('💬 Comments loaded');
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Data Table
|
||||
setupDataTable() {
|
||||
const loadMoreBtn = document.querySelector('.load-more-rows');
|
||||
const searchInput = document.querySelector('.search-input');
|
||||
const exportBtn = document.getElementById('export-btn');
|
||||
const sortableHeaders = document.querySelectorAll('.sortable');
|
||||
|
||||
// Load more rows
|
||||
loadMoreBtn.addEventListener('click', () => {
|
||||
this.loadMoreTableRows();
|
||||
});
|
||||
|
||||
// Search functionality
|
||||
searchInput.addEventListener('input', (e) => {
|
||||
const searchTerm = e.target.value.toLowerCase();
|
||||
const rows = document.querySelectorAll('#table-body tr');
|
||||
|
||||
rows.forEach(row => {
|
||||
const text = row.textContent.toLowerCase();
|
||||
row.style.display = text.includes(searchTerm) ? '' : 'none';
|
||||
});
|
||||
});
|
||||
|
||||
// Export functionality
|
||||
exportBtn.addEventListener('click', () => {
|
||||
console.log('📊 Exporting table data...');
|
||||
alert('Table data exported! (Check console)');
|
||||
});
|
||||
|
||||
// Sorting
|
||||
sortableHeaders.forEach(header => {
|
||||
header.addEventListener('click', () => {
|
||||
console.log(`🔄 Sorting by ${header.getAttribute('data-sort')}`);
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
loadMoreTableRows() {
|
||||
const tbody = document.getElementById('table-body');
|
||||
const batch = 10;
|
||||
|
||||
for (let i = 0; i < batch; i++) {
|
||||
const row = document.createElement('tr');
|
||||
const id = this.tableRowsLoaded + i + 1;
|
||||
row.innerHTML = `
|
||||
<td>User ${id}</td>
|
||||
<td>user${id}@example.com</td>
|
||||
<td>${new Date().toLocaleDateString()}</td>
|
||||
<td><button class="btn btn-sm">Edit</button></td>
|
||||
`;
|
||||
tbody.appendChild(row);
|
||||
}
|
||||
|
||||
this.tableRowsLoaded += batch;
|
||||
console.log(`📄 Loaded ${batch} more rows. Total: ${this.tableRowsLoaded}`);
|
||||
}
|
||||
|
||||
// Load initial data
|
||||
loadInitialData() {
|
||||
// Load initial products
|
||||
this.loadMoreProducts();
|
||||
|
||||
// Load initial table rows
|
||||
this.loadMoreTableRows();
|
||||
}
|
||||
|
||||
// Load content when navigating to sections
|
||||
loadSectionContent(sectionId) {
|
||||
switch(sectionId) {
|
||||
case 'catalog':
|
||||
// Ensure products are loaded in catalog
|
||||
if (this.productsLoaded === 0) {
|
||||
this.loadMoreProducts();
|
||||
}
|
||||
break;
|
||||
case 'data-tables':
|
||||
// Ensure table rows are loaded
|
||||
if (this.tableRowsLoaded === 0) {
|
||||
this.loadMoreTableRows();
|
||||
}
|
||||
break;
|
||||
case 'forms':
|
||||
// Forms are already set up
|
||||
break;
|
||||
case 'tabs':
|
||||
// Tabs content is static
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
// Inspector Mode
|
||||
setupInspector() {
|
||||
const inspectorBtn = document.getElementById('inspector-btn');
|
||||
|
||||
// Create tooltip element
|
||||
this.tooltip = document.createElement('div');
|
||||
this.tooltip.className = 'inspector-tooltip';
|
||||
this.tooltip.style.cssText = `
|
||||
position: fixed;
|
||||
background: rgba(0, 0, 0, 0.9);
|
||||
color: white;
|
||||
padding: 8px 12px;
|
||||
border-radius: 4px;
|
||||
font-size: 12px;
|
||||
font-family: monospace;
|
||||
pointer-events: none;
|
||||
z-index: 10000;
|
||||
display: none;
|
||||
max-width: 300px;
|
||||
`;
|
||||
document.body.appendChild(this.tooltip);
|
||||
|
||||
inspectorBtn.addEventListener('click', () => {
|
||||
this.toggleInspector();
|
||||
});
|
||||
|
||||
// Add mouse event listeners
|
||||
document.addEventListener('mousemove', this.handleMouseMove.bind(this));
|
||||
document.addEventListener('mouseout', this.handleMouseOut.bind(this));
|
||||
}
|
||||
|
||||
toggleInspector() {
|
||||
this.inspectorMode = !this.inspectorMode;
|
||||
const inspectorBtn = document.getElementById('inspector-btn');
|
||||
|
||||
if (this.inspectorMode) {
|
||||
inspectorBtn.classList.add('active');
|
||||
inspectorBtn.style.background = '#0fbbaa';
|
||||
document.body.style.cursor = 'crosshair';
|
||||
} else {
|
||||
inspectorBtn.classList.remove('active');
|
||||
inspectorBtn.style.background = '';
|
||||
document.body.style.cursor = '';
|
||||
this.tooltip.style.display = 'none';
|
||||
this.removeHighlight();
|
||||
}
|
||||
}
|
||||
|
||||
handleMouseMove(e) {
|
||||
if (!this.inspectorMode) return;
|
||||
|
||||
const element = e.target;
|
||||
if (element === this.tooltip) return;
|
||||
|
||||
// Highlight element
|
||||
this.highlightElement(element);
|
||||
|
||||
// Show tooltip with element info
|
||||
const info = this.getElementInfo(element);
|
||||
this.tooltip.innerHTML = info;
|
||||
this.tooltip.style.display = 'block';
|
||||
|
||||
// Position tooltip
|
||||
const x = e.clientX + 15;
|
||||
const y = e.clientY + 15;
|
||||
|
||||
// Adjust position if tooltip would go off screen
|
||||
const rect = this.tooltip.getBoundingClientRect();
|
||||
const adjustedX = x + rect.width > window.innerWidth ? x - rect.width - 30 : x;
|
||||
const adjustedY = y + rect.height > window.innerHeight ? y - rect.height - 30 : y;
|
||||
|
||||
this.tooltip.style.left = adjustedX + 'px';
|
||||
this.tooltip.style.top = adjustedY + 'px';
|
||||
}
|
||||
|
||||
handleMouseOut(e) {
|
||||
if (!this.inspectorMode) return;
|
||||
if (e.target === document.body) {
|
||||
this.removeHighlight();
|
||||
this.tooltip.style.display = 'none';
|
||||
}
|
||||
}
|
||||
|
||||
highlightElement(element) {
|
||||
this.removeHighlight();
|
||||
element.style.outline = '2px solid #0fbbaa';
|
||||
element.style.outlineOffset = '1px';
|
||||
element.setAttribute('data-inspector-highlighted', 'true');
|
||||
}
|
||||
|
||||
removeHighlight() {
|
||||
const highlighted = document.querySelector('[data-inspector-highlighted]');
|
||||
if (highlighted) {
|
||||
highlighted.style.outline = '';
|
||||
highlighted.style.outlineOffset = '';
|
||||
highlighted.removeAttribute('data-inspector-highlighted');
|
||||
}
|
||||
}
|
||||
|
||||
getElementInfo(element) {
|
||||
const tagName = element.tagName.toLowerCase();
|
||||
const id = element.id ? `#${element.id}` : '';
|
||||
const classes = element.className ?
|
||||
`.${element.className.split(' ').filter(c => c).join('.')}` : '';
|
||||
|
||||
let selector = tagName;
|
||||
if (id) {
|
||||
selector = id;
|
||||
} else if (classes) {
|
||||
selector = `${tagName}${classes}`;
|
||||
}
|
||||
|
||||
// Build info HTML
|
||||
let info = `<strong>${selector}</strong>`;
|
||||
|
||||
// Add additional attributes
|
||||
const attrs = [];
|
||||
if (element.name) attrs.push(`name="${element.name}"`);
|
||||
if (element.type) attrs.push(`type="${element.type}"`);
|
||||
if (element.href) attrs.push(`href="${element.href}"`);
|
||||
if (element.value && element.tagName === 'INPUT') attrs.push(`value="${element.value}"`);
|
||||
|
||||
if (attrs.length > 0) {
|
||||
info += `<br><span style="color: #888;">${attrs.join(' ')}</span>`;
|
||||
}
|
||||
|
||||
return info;
|
||||
}
|
||||
}
|
||||
|
||||
// Initialize app when DOM is ready
|
||||
document.addEventListener('DOMContentLoaded', () => {
|
||||
window.playgroundApp = new PlaygroundApp();
|
||||
console.log('🎮 Playground app initialized!');
|
||||
});
|
||||
328
docs/md_v2/apps/c4a-script/playground/index.html
Normal file
@@ -0,0 +1,328 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>C4A-Script Playground</title>
|
||||
<link rel="stylesheet" href="styles.css">
|
||||
</head>
|
||||
<body>
|
||||
<!-- Cookie Banner -->
|
||||
<div class="cookie-banner" id="cookie-banner">
|
||||
<div class="cookie-content">
|
||||
<p>🍪 We use cookies to enhance your experience. By continuing, you agree to our cookie policy.</p>
|
||||
<div class="cookie-actions">
|
||||
<button class="btn accept">Accept All</button>
|
||||
<button class="btn btn-secondary decline">Decline</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Newsletter Popup (appears after 3 seconds) -->
|
||||
<div class="modal" id="newsletter-popup" style="display: none;">
|
||||
<div class="modal-content">
|
||||
<span class="close">×</span>
|
||||
<h2>📬 Subscribe to Our Newsletter</h2>
|
||||
<p>Get the latest updates on web automation!</p>
|
||||
<input type="email" placeholder="Enter your email" class="input">
|
||||
<button class="btn subscribe">Subscribe</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Header -->
|
||||
<header class="site-header">
|
||||
<nav class="nav-menu">
|
||||
<a href="#home" class="nav-link active">Home</a>
|
||||
<a href="#catalog" class="nav-link" id="catalog-link">Products</a>
|
||||
<a href="#forms" class="nav-link">Forms</a>
|
||||
<a href="#data-tables" class="nav-link">Data Tables</a>
|
||||
<div class="dropdown">
|
||||
<a href="#" class="nav-link dropdown-toggle">More ▼</a>
|
||||
<div class="dropdown-content">
|
||||
<a href="#tabs">Tabs Demo</a>
|
||||
<a href="#accordion">FAQ</a>
|
||||
<a href="#gallery">Gallery</a>
|
||||
</div>
|
||||
</div>
|
||||
</nav>
|
||||
<div class="auth-section">
|
||||
<button class="btn btn-sm" id="inspector-btn" title="Toggle Inspector">🔍</button>
|
||||
<button class="btn btn-sm" id="login-btn">Login</button>
|
||||
<div class="user-info" id="user-info" style="display: none;">
|
||||
<span class="user-avatar">👤</span>
|
||||
<span class="welcome-message">Welcome, <span id="username-display">User</span>!</span>
|
||||
<button class="btn btn-sm btn-secondary" id="logout-btn">Logout</button>
|
||||
</div>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
<!-- Main Content -->
|
||||
<main class="main-content">
|
||||
<!-- Home Section -->
|
||||
<section id="home" class="section active">
|
||||
<h1>Welcome to C4A-Script Playground</h1>
|
||||
<p>This is an interactive demo for testing C4A-Script commands. Each section contains different challenges for web automation.</p>
|
||||
|
||||
<button class="btn btn-primary" id="start-tutorial">Start Tutorial</button>
|
||||
|
||||
<div class="feature-grid">
|
||||
<div class="feature-card">
|
||||
<h3>🔐 Authentication</h3>
|
||||
<p>Test login forms and user sessions</p>
|
||||
</div>
|
||||
<div class="feature-card">
|
||||
<h3>📜 Dynamic Content</h3>
|
||||
<p>Infinite scroll and pagination</p>
|
||||
</div>
|
||||
<div class="feature-card">
|
||||
<h3>📝 Forms</h3>
|
||||
<p>Complex form interactions</p>
|
||||
</div>
|
||||
<div class="feature-card">
|
||||
<h3>📊 Data Tables</h3>
|
||||
<p>Sortable and filterable data</p>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Login Modal -->
|
||||
<div class="modal" id="login-modal" style="display: none;">
|
||||
<div class="modal-content login-form">
|
||||
<span class="close">×</span>
|
||||
<h2>Login</h2>
|
||||
<form id="login-form">
|
||||
<div class="form-group">
|
||||
<label>Email</label>
|
||||
<input type="email" id="email" class="input" placeholder="demo@example.com">
|
||||
</div>
|
||||
<div class="form-group">
|
||||
<label>Password</label>
|
||||
<input type="password" id="password" class="input" placeholder="demo123">
|
||||
</div>
|
||||
<div class="form-group">
|
||||
<label class="checkbox-label">
|
||||
<input type="checkbox" id="remember-me">
|
||||
Remember me
|
||||
</label>
|
||||
</div>
|
||||
<button type="submit" class="btn btn-primary">Login</button>
|
||||
<div class="form-message" id="login-message"></div>
|
||||
</form>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Product Catalog Section -->
|
||||
<section id="catalog" class="section">
|
||||
<h1>Product Catalog</h1>
|
||||
|
||||
<div class="view-toggle">
|
||||
<button class="btn btn-sm active" id="infinite-scroll-btn">Infinite Scroll</button>
|
||||
<button class="btn btn-sm" id="pagination-btn">Pagination</button>
|
||||
</div>
|
||||
|
||||
<!-- Filters Sidebar -->
|
||||
<div class="catalog-layout">
|
||||
<aside class="filters-sidebar">
|
||||
<h3>Filters</h3>
|
||||
<div class="filter-group">
|
||||
<h4 class="collapsible">Category <span class="toggle">▼</span></h4>
|
||||
<div class="filter-content">
|
||||
<label><input type="checkbox"> Electronics</label>
|
||||
<label><input type="checkbox"> Clothing</label>
|
||||
<label><input type="checkbox"> Books</label>
|
||||
</div>
|
||||
</div>
|
||||
<div class="filter-group">
|
||||
<h4 class="collapsible">Price Range <span class="toggle">▼</span></h4>
|
||||
<div class="filter-content">
|
||||
<input type="range" min="0" max="1000" value="500">
|
||||
<span>$0 - $500</span>
|
||||
</div>
|
||||
</div>
|
||||
</aside>
|
||||
|
||||
<!-- Products Grid -->
|
||||
<div class="products-container">
|
||||
<div class="product-grid" id="product-grid">
|
||||
<!-- Products will be loaded here -->
|
||||
</div>
|
||||
|
||||
<!-- Infinite Scroll View -->
|
||||
<div id="infinite-scroll-view" class="view-mode">
|
||||
<div class="loading-indicator" id="loading-indicator" style="display: none;">
|
||||
<div class="spinner"></div>
|
||||
<p>Loading more products...</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Pagination View -->
|
||||
<div id="pagination-view" class="view-mode" style="display: none;">
|
||||
<button class="btn load-more">Load More</button>
|
||||
<div class="pagination">
|
||||
<button class="page-btn">1</button>
|
||||
<button class="page-btn">2</button>
|
||||
<button class="page-btn">3</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Forms Section -->
|
||||
<section id="forms" class="section">
|
||||
<h1>Form Examples</h1>
|
||||
|
||||
<!-- Contact Form -->
|
||||
<div class="form-card">
|
||||
<h2>Contact Form</h2>
|
||||
<form id="contact-form">
|
||||
<div class="form-group">
|
||||
<label>Name</label>
|
||||
<input type="text" class="input" id="contact-name">
|
||||
</div>
|
||||
<div class="form-group">
|
||||
<label>Email</label>
|
||||
<input type="email" class="input" id="contact-email">
|
||||
</div>
|
||||
<div class="form-group">
|
||||
<label>Subject</label>
|
||||
<select class="input" id="contact-subject">
|
||||
<option value="">Select a subject</option>
|
||||
<option value="support">Support</option>
|
||||
<option value="sales">Sales</option>
|
||||
<option value="feedback">Feedback</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="form-group" id="department-group" style="display: none;">
|
||||
<label>Department</label>
|
||||
<select class="input" id="department">
|
||||
<option value="">Select department</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="form-group">
|
||||
<label>Message</label>
|
||||
<textarea class="input" id="contact-message" rows="4"></textarea>
|
||||
</div>
|
||||
<button type="submit" class="btn btn-primary">Send Message</button>
|
||||
<div class="form-message" id="contact-message-display"></div>
|
||||
</form>
|
||||
</div>
|
||||
|
||||
<!-- Multi-step Form -->
|
||||
<div class="form-card">
|
||||
<h2>Multi-step Survey</h2>
|
||||
<div class="progress-bar">
|
||||
<div class="progress-fill" id="progress-fill" style="width: 33%"></div>
|
||||
</div>
|
||||
<form id="survey-form">
|
||||
<!-- Step 1 -->
|
||||
<div class="form-step active" data-step="1">
|
||||
<h3>Step 1: Basic Information</h3>
|
||||
<div class="form-group">
|
||||
<label>Full Name</label>
|
||||
<input type="text" class="input" id="full-name">
|
||||
</div>
|
||||
<div class="form-group">
|
||||
<label>Email</label>
|
||||
<input type="email" class="input" id="survey-email">
|
||||
</div>
|
||||
<button type="button" class="btn next-step">Next</button>
|
||||
</div>
|
||||
|
||||
<!-- Step 2 -->
|
||||
<div class="form-step" data-step="2" style="display: none;">
|
||||
<h3>Step 2: Preferences</h3>
|
||||
<div class="form-group">
|
||||
<label>Interests (select multiple)</label>
|
||||
<select multiple class="input" id="interests">
|
||||
<option value="tech">Technology</option>
|
||||
<option value="sports">Sports</option>
|
||||
<option value="music">Music</option>
|
||||
<option value="travel">Travel</option>
|
||||
</select>
|
||||
</div>
|
||||
<button type="button" class="btn prev-step">Previous</button>
|
||||
<button type="button" class="btn next-step">Next</button>
|
||||
</div>
|
||||
|
||||
<!-- Step 3 -->
|
||||
<div class="form-step" data-step="3" style="display: none;">
|
||||
<h3>Step 3: Confirmation</h3>
|
||||
<p>Please review your information and submit.</p>
|
||||
<button type="button" class="btn prev-step">Previous</button>
|
||||
<button type="submit" class="btn btn-primary" id="submit-survey">Submit Survey</button>
|
||||
</div>
|
||||
</form>
|
||||
<div class="form-message success-message" id="survey-success" style="display: none;">
|
||||
✅ Survey submitted successfully!
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Tabs Section -->
|
||||
<section id="tabs" class="section">
|
||||
<h1>Tabs Demo</h1>
|
||||
<div class="tabs-container">
|
||||
<div class="tabs-header">
|
||||
<button class="tab-btn active" data-tab="description">Description</button>
|
||||
<button class="tab-btn" data-tab="reviews">Reviews</button>
|
||||
<button class="tab-btn" data-tab="specs">Specifications</button>
|
||||
</div>
|
||||
<div class="tabs-content">
|
||||
<div class="tab-pane active" id="description">
|
||||
<h3>Product Description</h3>
|
||||
<p>This is a detailed description of the product...</p>
|
||||
<div class="expandable-text">
|
||||
<p class="text-preview">Lorem ipsum dolor sit amet, consectetur adipiscing elit...</p>
|
||||
<button class="btn btn-sm show-more">Show More</button>
|
||||
<div class="hidden-text" style="display: none;">
|
||||
<p>This is the hidden text that appears when you click "Show More". It contains additional details about the product that weren't visible initially.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="tab-pane" id="reviews" style="display: none;">
|
||||
<h3>Customer Reviews</h3>
|
||||
<button class="btn btn-sm load-comments">Load Comments</button>
|
||||
<div class="comments-section" style="display: none;">
|
||||
<!-- Comments will be loaded here -->
|
||||
</div>
|
||||
</div>
|
||||
<div class="tab-pane" id="specs" style="display: none;">
|
||||
<h3>Technical Specifications</h3>
|
||||
<table class="specs-table">
|
||||
<tr><td>Model</td><td>XYZ-2000</td></tr>
|
||||
<tr><td>Weight</td><td>2.5 kg</td></tr>
|
||||
<tr><td>Dimensions</td><td>30 x 20 x 10 cm</td></tr>
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Data Tables Section -->
|
||||
<section id="data-tables" class="section">
|
||||
<h1>Data Tables</h1>
|
||||
<div class="table-controls">
|
||||
<input type="text" class="input search-input" placeholder="Search...">
|
||||
<button class="btn btn-sm" id="export-btn">Export</button>
|
||||
</div>
|
||||
<table class="data-table" id="data-table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th class="sortable" data-sort="name">Name ↕</th>
|
||||
<th class="sortable" data-sort="email">Email ↕</th>
|
||||
<th class="sortable" data-sort="date">Date ↕</th>
|
||||
<th>Actions</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody id="table-body">
|
||||
<!-- Table rows will be loaded here -->
|
||||
</tbody>
|
||||
</table>
|
||||
<button class="btn load-more-rows">Load More Rows</button>
|
||||
</section>
|
||||
</main>
|
||||
|
||||
<script src="app.js"></script>
|
||||
</body>
|
||||
</html>
|
||||
627
docs/md_v2/apps/c4a-script/playground/styles.css
Normal file
@@ -0,0 +1,627 @@
|
||||
/* Playground Styles - Modern Web App Theme */
|
||||
:root {
|
||||
--primary-color: #0fbbaa;
|
||||
--secondary-color: #3f3f44;
|
||||
--background-color: #ffffff;
|
||||
--text-color: #333333;
|
||||
--border-color: #e0e0e0;
|
||||
--error-color: #ff3c74;
|
||||
--success-color: #0fbbaa;
|
||||
--warning-color: #ffa500;
|
||||
}
|
||||
|
||||
* {
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
body {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
|
||||
font-size: 16px;
|
||||
line-height: 1.6;
|
||||
color: var(--text-color);
|
||||
background-color: var(--background-color);
|
||||
}
|
||||
|
||||
/* Cookie Banner */
|
||||
.cookie-banner {
|
||||
position: fixed;
|
||||
bottom: 0;
|
||||
left: 0;
|
||||
right: 0;
|
||||
background-color: #2c3e50;
|
||||
color: white;
|
||||
padding: 1rem;
|
||||
z-index: 1000;
|
||||
box-shadow: 0 -2px 10px rgba(0,0,0,0.1);
|
||||
}
|
||||
|
||||
.cookie-content {
|
||||
max-width: 1200px;
|
||||
margin: 0 auto;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: space-between;
|
||||
flex-wrap: wrap;
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
.cookie-actions {
|
||||
display: flex;
|
||||
gap: 0.5rem;
|
||||
}
|
||||
|
||||
/* Header */
|
||||
.site-header {
|
||||
background-color: #fff;
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
padding: 1rem 2rem;
|
||||
position: sticky;
|
||||
top: 0;
|
||||
z-index: 100;
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
}
|
||||
|
||||
.nav-menu {
|
||||
display: flex;
|
||||
gap: 2rem;
|
||||
align-items: center;
|
||||
}
|
||||
|
||||
.nav-link {
|
||||
text-decoration: none;
|
||||
color: var(--text-color);
|
||||
font-weight: 500;
|
||||
transition: color 0.2s;
|
||||
}
|
||||
|
||||
.nav-link:hover,
|
||||
.nav-link.active {
|
||||
color: var(--primary-color);
|
||||
}
|
||||
|
||||
/* Dropdown */
|
||||
.dropdown {
|
||||
position: relative;
|
||||
}
|
||||
|
||||
.dropdown-content {
|
||||
display: none;
|
||||
position: absolute;
|
||||
background-color: white;
|
||||
min-width: 160px;
|
||||
box-shadow: 0 8px 16px rgba(0,0,0,0.1);
|
||||
z-index: 1;
|
||||
border-radius: 4px;
|
||||
top: 100%;
|
||||
margin-top: 0.5rem;
|
||||
}
|
||||
|
||||
.dropdown:hover .dropdown-content {
|
||||
display: block;
|
||||
}
|
||||
|
||||
.dropdown-content a {
|
||||
color: var(--text-color);
|
||||
padding: 0.75rem 1rem;
|
||||
text-decoration: none;
|
||||
display: block;
|
||||
}
|
||||
|
||||
.dropdown-content a:hover {
|
||||
background-color: #f5f5f5;
|
||||
}
|
||||
|
||||
/* Auth Section */
|
||||
.auth-section {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
.user-info {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 0.5rem;
|
||||
}
|
||||
|
||||
.user-avatar {
|
||||
font-size: 1.5rem;
|
||||
}
|
||||
|
||||
/* Main Content */
|
||||
.main-content {
|
||||
padding: 2rem;
|
||||
max-width: 1200px;
|
||||
margin: 0 auto;
|
||||
}
|
||||
|
||||
.section {
|
||||
display: none;
|
||||
}
|
||||
|
||||
.section.active {
|
||||
display: block;
|
||||
}
|
||||
|
||||
/* Buttons */
|
||||
.btn {
|
||||
background-color: var(--primary-color);
|
||||
color: white;
|
||||
border: none;
|
||||
padding: 0.5rem 1rem;
|
||||
border-radius: 4px;
|
||||
cursor: pointer;
|
||||
font-size: 1rem;
|
||||
font-weight: 500;
|
||||
transition: all 0.2s;
|
||||
}
|
||||
|
||||
.btn:hover {
|
||||
background-color: #0aa599;
|
||||
transform: translateY(-1px);
|
||||
}
|
||||
|
||||
.btn-sm {
|
||||
padding: 0.25rem 0.75rem;
|
||||
font-size: 0.875rem;
|
||||
}
|
||||
|
||||
.btn-secondary {
|
||||
background-color: var(--secondary-color);
|
||||
}
|
||||
|
||||
.btn-secondary:hover {
|
||||
background-color: #333;
|
||||
}
|
||||
|
||||
.btn-primary {
|
||||
background-color: var(--primary-color);
|
||||
}
|
||||
|
||||
/* Feature Grid */
|
||||
.feature-grid {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
|
||||
gap: 1.5rem;
|
||||
margin-top: 2rem;
|
||||
}
|
||||
|
||||
.feature-card {
|
||||
background-color: #f8f9fa;
|
||||
padding: 1.5rem;
|
||||
border-radius: 8px;
|
||||
text-align: center;
|
||||
transition: transform 0.2s;
|
||||
}
|
||||
|
||||
.feature-card:hover {
|
||||
transform: translateY(-4px);
|
||||
box-shadow: 0 4px 12px rgba(0,0,0,0.1);
|
||||
}
|
||||
|
||||
.feature-card h3 {
|
||||
margin-top: 0;
|
||||
}
|
||||
|
||||
/* Modal */
|
||||
.modal {
|
||||
position: fixed;
|
||||
z-index: 1000;
|
||||
left: 0;
|
||||
top: 0;
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
background-color: rgba(0,0,0,0.5);
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
}
|
||||
|
||||
.modal-content {
|
||||
background-color: white;
|
||||
padding: 2rem;
|
||||
border-radius: 8px;
|
||||
max-width: 500px;
|
||||
width: 90%;
|
||||
position: relative;
|
||||
animation: modalFadeIn 0.3s;
|
||||
}
|
||||
|
||||
@keyframes modalFadeIn {
|
||||
from { opacity: 0; transform: translateY(-20px); }
|
||||
to { opacity: 1; transform: translateY(0); }
|
||||
}
|
||||
|
||||
.close {
|
||||
position: absolute;
|
||||
right: 1rem;
|
||||
top: 1rem;
|
||||
font-size: 1.5rem;
|
||||
cursor: pointer;
|
||||
color: #999;
|
||||
}
|
||||
|
||||
.close:hover {
|
||||
color: #333;
|
||||
}
|
||||
|
||||
/* Forms */
|
||||
.form-group {
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
|
||||
.form-group label {
|
||||
display: block;
|
||||
margin-bottom: 0.5rem;
|
||||
font-weight: 500;
|
||||
}
|
||||
|
||||
.input {
|
||||
width: 100%;
|
||||
padding: 0.5rem;
|
||||
border: 1px solid var(--border-color);
|
||||
border-radius: 4px;
|
||||
font-size: 1rem;
|
||||
}
|
||||
|
||||
.input:focus {
|
||||
outline: none;
|
||||
border-color: var(--primary-color);
|
||||
}
|
||||
|
||||
.checkbox-label {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 0.5rem;
|
||||
}
|
||||
|
||||
.form-message {
|
||||
margin-top: 1rem;
|
||||
padding: 0.75rem;
|
||||
border-radius: 4px;
|
||||
display: none;
|
||||
}
|
||||
|
||||
.form-message.error {
|
||||
background-color: #ffe6e6;
|
||||
color: var(--error-color);
|
||||
display: block;
|
||||
}
|
||||
|
||||
.form-message.success {
|
||||
background-color: #e6fff6;
|
||||
color: var(--success-color);
|
||||
display: block;
|
||||
}
|
||||
|
||||
/* Product Catalog */
|
||||
.view-toggle {
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
|
||||
.catalog-layout {
|
||||
display: grid;
|
||||
grid-template-columns: 250px 1fr;
|
||||
gap: 2rem;
|
||||
}
|
||||
|
||||
.filters-sidebar {
|
||||
background-color: #f8f9fa;
|
||||
padding: 1rem;
|
||||
border-radius: 8px;
|
||||
}
|
||||
|
||||
.filter-group {
|
||||
margin-bottom: 1.5rem;
|
||||
}
|
||||
|
||||
.collapsible {
|
||||
cursor: pointer;
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
}
|
||||
|
||||
.filter-content {
|
||||
margin-top: 0.5rem;
|
||||
}
|
||||
|
||||
.filter-content label {
|
||||
display: block;
|
||||
margin-bottom: 0.5rem;
|
||||
}
|
||||
|
||||
/* Product Grid */
|
||||
.product-grid {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fill, minmax(200px, 1fr));
|
||||
gap: 1.5rem;
|
||||
}
|
||||
|
||||
.product-card {
|
||||
background-color: white;
|
||||
border: 1px solid var(--border-color);
|
||||
border-radius: 8px;
|
||||
padding: 1rem;
|
||||
text-align: center;
|
||||
transition: transform 0.2s;
|
||||
}
|
||||
|
||||
.product-card:hover {
|
||||
transform: translateY(-4px);
|
||||
box-shadow: 0 4px 12px rgba(0,0,0,0.1);
|
||||
}
|
||||
|
||||
.product-image {
|
||||
width: 100%;
|
||||
height: 150px;
|
||||
background-color: #f0f0f0;
|
||||
margin-bottom: 1rem;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
font-size: 3rem;
|
||||
}
|
||||
|
||||
.product-name {
|
||||
font-weight: 600;
|
||||
margin-bottom: 0.5rem;
|
||||
}
|
||||
|
||||
.product-price {
|
||||
color: var(--primary-color);
|
||||
font-size: 1.2rem;
|
||||
font-weight: 700;
|
||||
}
|
||||
|
||||
/* Loading Indicator */
|
||||
.loading-indicator {
|
||||
text-align: center;
|
||||
padding: 2rem;
|
||||
}
|
||||
|
||||
.spinner {
|
||||
border: 3px solid #f3f3f3;
|
||||
border-top: 3px solid var(--primary-color);
|
||||
border-radius: 50%;
|
||||
width: 40px;
|
||||
height: 40px;
|
||||
animation: spin 1s linear infinite;
|
||||
margin: 0 auto;
|
||||
}
|
||||
|
||||
@keyframes spin {
|
||||
0% { transform: rotate(0deg); }
|
||||
100% { transform: rotate(360deg); }
|
||||
}
|
||||
|
||||
/* Pagination */
|
||||
.pagination {
|
||||
display: flex;
|
||||
gap: 0.5rem;
|
||||
justify-content: center;
|
||||
margin-top: 2rem;
|
||||
}
|
||||
|
||||
.page-btn {
|
||||
padding: 0.5rem 1rem;
|
||||
border: 1px solid var(--border-color);
|
||||
background-color: white;
|
||||
cursor: pointer;
|
||||
border-radius: 4px;
|
||||
}
|
||||
|
||||
.page-btn:hover,
|
||||
.page-btn.active {
|
||||
background-color: var(--primary-color);
|
||||
color: white;
|
||||
}
|
||||
|
||||
/* Multi-step Form */
|
||||
.progress-bar {
|
||||
width: 100%;
|
||||
height: 8px;
|
||||
background-color: #e0e0e0;
|
||||
border-radius: 4px;
|
||||
margin-bottom: 2rem;
|
||||
}
|
||||
|
||||
.progress-fill {
|
||||
height: 100%;
|
||||
background-color: var(--primary-color);
|
||||
border-radius: 4px;
|
||||
transition: width 0.3s;
|
||||
}
|
||||
|
||||
.form-step {
|
||||
display: none;
|
||||
}
|
||||
|
||||
.form-step.active {
|
||||
display: block;
|
||||
}
|
||||
|
||||
/* Tabs */
|
||||
.tabs-container {
|
||||
margin-top: 2rem;
|
||||
}
|
||||
|
||||
.tabs-header {
|
||||
display: flex;
|
||||
border-bottom: 2px solid var(--border-color);
|
||||
}
|
||||
|
||||
.tab-btn {
|
||||
background: none;
|
||||
border: none;
|
||||
padding: 1rem 2rem;
|
||||
cursor: pointer;
|
||||
font-size: 1rem;
|
||||
font-weight: 500;
|
||||
color: var(--text-color);
|
||||
position: relative;
|
||||
}
|
||||
|
||||
.tab-btn:hover {
|
||||
color: var(--primary-color);
|
||||
}
|
||||
|
||||
.tab-btn.active {
|
||||
color: var(--primary-color);
|
||||
}
|
||||
|
||||
.tab-btn.active::after {
|
||||
content: '';
|
||||
position: absolute;
|
||||
bottom: -2px;
|
||||
left: 0;
|
||||
right: 0;
|
||||
height: 2px;
|
||||
background-color: var(--primary-color);
|
||||
}
|
||||
|
||||
.tabs-content {
|
||||
padding: 2rem 0;
|
||||
}
|
||||
|
||||
.tab-pane {
|
||||
display: none;
|
||||
}
|
||||
|
||||
.tab-pane.active {
|
||||
display: block;
|
||||
}
|
||||
|
||||
/* Expandable Text */
|
||||
.expandable-text {
|
||||
margin-top: 1rem;
|
||||
}
|
||||
|
||||
.text-preview {
|
||||
margin-bottom: 0.5rem;
|
||||
}
|
||||
|
||||
.show-more {
|
||||
margin-top: 0.5rem;
|
||||
}
|
||||
|
||||
/* Comments Section */
|
||||
.comments-section {
|
||||
margin-top: 1rem;
|
||||
}
|
||||
|
||||
.comment {
|
||||
background-color: #f8f9fa;
|
||||
padding: 1rem;
|
||||
border-radius: 4px;
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
|
||||
.comment-author {
|
||||
font-weight: 600;
|
||||
margin-bottom: 0.5rem;
|
||||
}
|
||||
|
||||
/* Data Table */
|
||||
.table-controls {
|
||||
display: flex;
|
||||
gap: 1rem;
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
|
||||
.search-input {
|
||||
flex: 1;
|
||||
max-width: 300px;
|
||||
}
|
||||
|
||||
.data-table {
|
||||
width: 100%;
|
||||
border-collapse: collapse;
|
||||
background-color: white;
|
||||
}
|
||||
|
||||
.data-table th,
|
||||
.data-table td {
|
||||
padding: 0.75rem;
|
||||
text-align: left;
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
}
|
||||
|
||||
.data-table th {
|
||||
background-color: #f8f9fa;
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
.sortable {
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.sortable:hover {
|
||||
color: var(--primary-color);
|
||||
}
|
||||
|
||||
/* Form Cards */
|
||||
.form-card {
|
||||
background-color: white;
|
||||
border: 1px solid var(--border-color);
|
||||
border-radius: 8px;
|
||||
padding: 2rem;
|
||||
margin-bottom: 2rem;
|
||||
}
|
||||
|
||||
.form-card h2 {
|
||||
margin-top: 0;
|
||||
}
|
||||
|
||||
/* Success Message */
|
||||
.success-message {
|
||||
background-color: #e6fff6;
|
||||
color: var(--success-color);
|
||||
padding: 1rem;
|
||||
border-radius: 4px;
|
||||
text-align: center;
|
||||
font-weight: 500;
|
||||
}
|
||||
|
||||
/* Load More Button */
|
||||
.load-more,
|
||||
.load-more-rows {
|
||||
display: block;
|
||||
margin: 2rem auto;
|
||||
}
|
||||
|
||||
/* Responsive */
|
||||
@media (max-width: 768px) {
|
||||
.catalog-layout {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
|
||||
.feature-grid {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
|
||||
.nav-menu {
|
||||
flex-wrap: wrap;
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
.cookie-content {
|
||||
flex-direction: column;
|
||||
text-align: center;
|
||||
}
|
||||
}
|
||||
|
||||
/* Inspector Mode */
|
||||
#inspector-btn.active {
|
||||
background: var(--primary-color) !important;
|
||||
color: var(--bg-primary) !important;
|
||||
}
|
||||
|
||||
.inspector-tooltip {
|
||||
box-shadow: 0 2px 8px rgba(0, 0, 0, 0.3);
|
||||
border: 1px solid rgba(255, 255, 255, 0.1);
|
||||
}
|
||||
2
docs/md_v2/apps/c4a-script/requirements.txt
Normal file
@@ -0,0 +1,2 @@
|
||||
flask>=2.3.0
|
||||
flask-cors>=4.0.0
|
||||
18
docs/md_v2/apps/c4a-script/scripts/01-basic-interaction.c4a
Normal file
@@ -0,0 +1,18 @@
|
||||
# Basic Page Interaction
|
||||
# This script demonstrates basic C4A commands
|
||||
|
||||
# Navigate to the playground
|
||||
GO http://127.0.0.1:8080/playground/
|
||||
|
||||
# Wait for page to load
|
||||
WAIT `body` 2
|
||||
|
||||
# Handle cookie banner if present
|
||||
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
|
||||
|
||||
# Close newsletter popup if it appears
|
||||
WAIT 3
|
||||
IF (EXISTS `.newsletter-popup`) THEN CLICK `.close`
|
||||
|
||||
# Click the start tutorial button
|
||||
CLICK `#start-tutorial`
|
||||
27
docs/md_v2/apps/c4a-script/scripts/02-login-flow.c4a
Normal file
@@ -0,0 +1,27 @@
|
||||
# Complete Login Flow
|
||||
# Demonstrates form interaction and authentication
|
||||
|
||||
# Click login button
|
||||
CLICK `#login-btn`
|
||||
|
||||
# Wait for login modal
|
||||
WAIT `.login-form` 3
|
||||
|
||||
# Fill in credentials
|
||||
CLICK `#email`
|
||||
TYPE "demo@example.com"
|
||||
|
||||
CLICK `#password`
|
||||
TYPE "demo123"
|
||||
|
||||
# Check remember me
|
||||
IF (EXISTS `#remember-me`) THEN CLICK `#remember-me`
|
||||
|
||||
# Submit form
|
||||
CLICK `button[type="submit"]`
|
||||
|
||||
# Wait for success
|
||||
WAIT `.welcome-message` 5
|
||||
|
||||
# Verify login succeeded
|
||||
IF (EXISTS `.user-info`) THEN EVAL `console.log('✅ Login successful!')`
|
||||
32
docs/md_v2/apps/c4a-script/scripts/03-infinite-scroll.c4a
Normal file
@@ -0,0 +1,32 @@
|
||||
# Infinite Scroll Product Loading
|
||||
# Load all products using scroll automation
|
||||
|
||||
# Navigate to catalog
|
||||
CLICK `#catalog-link`
|
||||
WAIT `.product-grid` 3
|
||||
|
||||
# Switch to infinite scroll mode
|
||||
CLICK `#infinite-scroll-btn`
|
||||
|
||||
# Define scroll procedure
|
||||
PROC load_more_products
|
||||
# Get current product count
|
||||
EVAL `window.initialCount = document.querySelectorAll('.product-card').length`
|
||||
|
||||
# Scroll down
|
||||
SCROLL DOWN 1000
|
||||
WAIT 2
|
||||
|
||||
# Check if more products loaded
|
||||
EVAL `
|
||||
const newCount = document.querySelectorAll('.product-card').length;
|
||||
console.log('Products loaded: ' + newCount);
|
||||
window.moreLoaded = newCount > window.initialCount;
|
||||
`
|
||||
ENDPROC
|
||||
|
||||
# Load products until no more
|
||||
REPEAT (load_more_products, `window.moreLoaded !== false`)
|
||||
|
||||
# Final count
|
||||
EVAL `console.log('✅ Total products: ' + document.querySelectorAll('.product-card').length)`
|
||||
41
docs/md_v2/apps/c4a-script/scripts/04-multi-step-form.c4a
Normal file
@@ -0,0 +1,41 @@
|
||||
# Multi-step Form Wizard
|
||||
# Complete a complex form with multiple steps
|
||||
|
||||
# Navigate to forms section
|
||||
CLICK `a[href="#forms"]`
|
||||
WAIT `#survey-form` 2
|
||||
|
||||
# Step 1: Basic Information
|
||||
CLICK `#full-name`
|
||||
TYPE "John Doe"
|
||||
|
||||
CLICK `#survey-email`
|
||||
TYPE "john.doe@example.com"
|
||||
|
||||
# Go to next step
|
||||
CLICK `.next-step`
|
||||
WAIT 1
|
||||
|
||||
# Step 2: Select Interests
|
||||
# Select multiple options
|
||||
CLICK `#interests`
|
||||
CLICK `option[value="tech"]`
|
||||
CLICK `option[value="music"]`
|
||||
CLICK `option[value="travel"]`
|
||||
|
||||
# Continue to final step
|
||||
CLICK `.next-step`
|
||||
WAIT 1
|
||||
|
||||
# Step 3: Review and Submit
|
||||
# Verify we're on the last step
|
||||
IF (EXISTS `#submit-survey`) THEN EVAL `console.log('📋 On final step')`
|
||||
|
||||
# Submit the form
|
||||
CLICK `#submit-survey`
|
||||
|
||||
# Wait for success message
|
||||
WAIT `.success-message` 5
|
||||
|
||||
# Verify submission
|
||||
IF (EXISTS `.success-message`) THEN EVAL `console.log('✅ Survey submitted successfully!')`
|
||||
82
docs/md_v2/apps/c4a-script/scripts/05-complex-workflow.c4a
Normal file
@@ -0,0 +1,82 @@
|
||||
# Complete E-commerce Workflow
|
||||
# Login, browse products, and interact with various elements
|
||||
|
||||
# Define reusable procedures
|
||||
PROC handle_popups
|
||||
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
|
||||
IF (EXISTS `.newsletter-popup`) THEN CLICK `.close`
|
||||
ENDPROC
|
||||
|
||||
PROC login_user
|
||||
CLICK `#login-btn`
|
||||
WAIT `.login-form` 2
|
||||
CLICK `#email`
|
||||
TYPE "demo@example.com"
|
||||
CLICK `#password`
|
||||
TYPE "demo123"
|
||||
CLICK `button[type="submit"]`
|
||||
WAIT `.welcome-message` 5
|
||||
ENDPROC
|
||||
|
||||
PROC browse_products
|
||||
# Go to catalog
|
||||
CLICK `#catalog-link`
|
||||
WAIT `.product-grid` 3
|
||||
|
||||
# Apply filters
|
||||
CLICK `.collapsible`
|
||||
WAIT 0.5
|
||||
CLICK `input[type="checkbox"]`
|
||||
|
||||
# Load some products
|
||||
SCROLL DOWN 500
|
||||
WAIT 1
|
||||
SCROLL DOWN 500
|
||||
WAIT 1
|
||||
ENDPROC
|
||||
|
||||
# Main workflow
|
||||
GO http://127.0.0.1:8080/playground/
|
||||
WAIT `body` 2
|
||||
|
||||
# Handle initial popups
|
||||
handle_popups
|
||||
|
||||
# Login if not already
|
||||
IF (NOT EXISTS `.user-info`) THEN login_user
|
||||
|
||||
# Browse products
|
||||
browse_products
|
||||
|
||||
# Navigate to tabs demo
|
||||
CLICK `a[href="#tabs"]`
|
||||
WAIT `.tabs-container` 2
|
||||
|
||||
# Interact with tabs
|
||||
CLICK `button[data-tab="reviews"]`
|
||||
WAIT 1
|
||||
|
||||
# Load comments
|
||||
IF (EXISTS `.load-comments`) THEN CLICK `.load-comments`
|
||||
WAIT `.comments-section` 2
|
||||
|
||||
# Check specifications
|
||||
CLICK `button[data-tab="specs"]`
|
||||
WAIT 1
|
||||
|
||||
# Final navigation to data tables
|
||||
CLICK `a[href="#data"]`
|
||||
WAIT `.data-table` 2
|
||||
|
||||
# Search in table
|
||||
CLICK `.search-input`
|
||||
TYPE "User"
|
||||
|
||||
# Load more rows
|
||||
CLICK `.load-more-rows`
|
||||
WAIT 1
|
||||
|
||||
# Export data
|
||||
CLICK `#export-btn`
|
||||
|
||||
EVAL `console.log('✅ Workflow completed successfully!')`
|
||||
304
docs/md_v2/apps/c4a-script/server.py
Normal file
@@ -0,0 +1,304 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
C4A-Script Tutorial Server
|
||||
Serves the tutorial app and provides C4A compilation API
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
from pathlib import Path
|
||||
from flask import Flask, render_template_string, request, jsonify, send_from_directory
|
||||
from flask_cors import CORS
|
||||
|
||||
# Add parent directories to path to import crawl4ai
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent.parent.parent.parent))
|
||||
|
||||
try:
|
||||
from crawl4ai.script import compile as c4a_compile
|
||||
C4A_AVAILABLE = True
|
||||
except ImportError:
|
||||
print("⚠️ C4A compiler not available. Using mock compiler.")
|
||||
C4A_AVAILABLE = False
|
||||
|
||||
app = Flask(__name__)
|
||||
CORS(app)
|
||||
|
||||
# Serve static files
|
||||
@app.route('/')
|
||||
def index():
|
||||
return send_from_directory('.', 'index.html')
|
||||
|
||||
@app.route('/assets/<path:path>')
|
||||
def serve_assets(path):
|
||||
return send_from_directory('assets', path)
|
||||
|
||||
@app.route('/playground/')
|
||||
def playground():
|
||||
return send_from_directory('playground', 'index.html')
|
||||
|
||||
@app.route('/playground/<path:path>')
|
||||
def serve_playground(path):
|
||||
return send_from_directory('playground', path)
|
||||
|
||||
# API endpoint for C4A compilation
|
||||
@app.route('/api/compile', methods=['POST'])
|
||||
def compile_endpoint():
|
||||
try:
|
||||
data = request.get_json()
|
||||
script = data.get('script', '')
|
||||
|
||||
if not script:
|
||||
return jsonify({
|
||||
'success': False,
|
||||
'error': {
|
||||
'line': 1,
|
||||
'column': 1,
|
||||
'message': 'No script provided',
|
||||
'suggestion': 'Write some C4A commands'
|
||||
}
|
||||
})
|
||||
|
||||
if C4A_AVAILABLE:
|
||||
# Use real C4A compiler
|
||||
result = c4a_compile(script)
|
||||
|
||||
if result.success:
|
||||
return jsonify({
|
||||
'success': True,
|
||||
'jsCode': result.js_code,
|
||||
'metadata': {
|
||||
'lineCount': len(result.js_code),
|
||||
'sourceLines': len(script.split('\n'))
|
||||
}
|
||||
})
|
||||
else:
|
||||
error = result.first_error
|
||||
return jsonify({
|
||||
'success': False,
|
||||
'error': {
|
||||
'line': error.line,
|
||||
'column': error.column,
|
||||
'message': error.message,
|
||||
'suggestion': error.suggestions[0].message if error.suggestions else None,
|
||||
'code': error.code,
|
||||
'sourceLine': error.source_line
|
||||
}
|
||||
})
|
||||
else:
|
||||
# Use mock compiler for demo
|
||||
result = mock_compile(script)
|
||||
return jsonify(result)
|
||||
|
||||
except Exception as e:
|
||||
return jsonify({
|
||||
'success': False,
|
||||
'error': {
|
||||
'line': 1,
|
||||
'column': 1,
|
||||
'message': f'Server error: {str(e)}',
|
||||
'suggestion': 'Check server logs'
|
||||
}
|
||||
}), 500
|
||||
|
||||
def mock_compile(script):
|
||||
"""Simple mock compiler for demo when C4A is not available"""
|
||||
lines = [line for line in script.split('\n') if line.strip() and not line.strip().startswith('#')]
|
||||
js_code = []
|
||||
|
||||
for i, line in enumerate(lines):
|
||||
line = line.strip()
|
||||
|
||||
try:
|
||||
if line.startswith('GO '):
|
||||
url = line[3:].strip()
|
||||
# Handle relative URLs
|
||||
if not url.startswith(('http://', 'https://')):
|
||||
url = '/' + url.lstrip('/')
|
||||
js_code.append(f"await page.goto('{url}');")
|
||||
|
||||
elif line.startswith('WAIT '):
|
||||
parts = line[5:].strip().split(' ')
|
||||
if parts[0].startswith('`'):
|
||||
selector = parts[0].strip('`')
|
||||
timeout = parts[1] if len(parts) > 1 else '5'
|
||||
js_code.append(f"await page.waitForSelector('{selector}', {{ timeout: {timeout}000 }});")
|
||||
else:
|
||||
seconds = parts[0]
|
||||
js_code.append(f"await page.waitForTimeout({seconds}000);")
|
||||
|
||||
elif line.startswith('CLICK '):
|
||||
selector = line[6:].strip().strip('`')
|
||||
js_code.append(f"await page.click('{selector}');")
|
||||
|
||||
elif line.startswith('TYPE '):
|
||||
text = line[5:].strip().strip('"')
|
||||
js_code.append(f"await page.keyboard.type('{text}');")
|
||||
|
||||
elif line.startswith('SCROLL '):
|
||||
parts = line[7:].strip().split(' ')
|
||||
direction = parts[0]
|
||||
amount = parts[1] if len(parts) > 1 else '500'
|
||||
if direction == 'DOWN':
|
||||
js_code.append(f"await page.evaluate(() => window.scrollBy(0, {amount}));")
|
||||
elif direction == 'UP':
|
||||
js_code.append(f"await page.evaluate(() => window.scrollBy(0, -{amount}));")
|
||||
|
||||
elif line.startswith('IF '):
|
||||
if 'THEN' not in line:
|
||||
return {
|
||||
'success': False,
|
||||
'error': {
|
||||
'line': i + 1,
|
||||
'column': len(line),
|
||||
'message': "Missing 'THEN' keyword after IF condition",
|
||||
'suggestion': "Add 'THEN' after the condition",
|
||||
'sourceLine': line
|
||||
}
|
||||
}
|
||||
|
||||
condition = line[3:line.index('THEN')].strip()
|
||||
action = line[line.index('THEN') + 4:].strip()
|
||||
|
||||
if 'EXISTS' in condition:
|
||||
selector_match = condition.split('`')
|
||||
if len(selector_match) >= 2:
|
||||
selector = selector_match[1]
|
||||
action_selector = action.split('`')[1] if '`' in action else ''
|
||||
js_code.append(
|
||||
f"if (await page.$$('{selector}').length > 0) {{ "
|
||||
f"await page.click('{action_selector}'); }}"
|
||||
)
|
||||
|
||||
elif line.startswith('PRESS '):
|
||||
key = line[6:].strip()
|
||||
js_code.append(f"await page.keyboard.press('{key}');")
|
||||
|
||||
else:
|
||||
# Unknown command
|
||||
return {
|
||||
'success': False,
|
||||
'error': {
|
||||
'line': i + 1,
|
||||
'column': 1,
|
||||
'message': f"Unknown command: {line.split()[0]}",
|
||||
'suggestion': "Check command syntax",
|
||||
'sourceLine': line
|
||||
}
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
'success': False,
|
||||
'error': {
|
||||
'line': i + 1,
|
||||
'column': 1,
|
||||
'message': f"Failed to parse: {str(e)}",
|
||||
'suggestion': "Check syntax",
|
||||
'sourceLine': line
|
||||
}
|
||||
}
|
||||
|
||||
return {
|
||||
'success': True,
|
||||
'jsCode': js_code,
|
||||
'metadata': {
|
||||
'lineCount': len(js_code),
|
||||
'sourceLines': len(lines)
|
||||
}
|
||||
}
|
||||
|
||||
# Example scripts endpoint
|
||||
@app.route('/api/examples')
|
||||
def get_examples():
|
||||
examples = [
|
||||
{
|
||||
'id': 'cookie-banner',
|
||||
'name': 'Handle Cookie Banner',
|
||||
'description': 'Accept cookies and close newsletter popup',
|
||||
'script': '''# Handle cookie banner and newsletter
|
||||
GO http://127.0.0.1:8080/playground/
|
||||
WAIT `body` 2
|
||||
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
|
||||
IF (EXISTS `.newsletter-popup`) THEN CLICK `.close`'''
|
||||
},
|
||||
{
|
||||
'id': 'login',
|
||||
'name': 'Login Flow',
|
||||
'description': 'Complete login with credentials',
|
||||
'script': '''# Login to the site
|
||||
CLICK `#login-btn`
|
||||
WAIT `.login-form` 2
|
||||
CLICK `#email`
|
||||
TYPE "demo@example.com"
|
||||
CLICK `#password`
|
||||
TYPE "demo123"
|
||||
IF (EXISTS `#remember-me`) THEN CLICK `#remember-me`
|
||||
CLICK `button[type="submit"]`
|
||||
WAIT `.welcome-message` 5'''
|
||||
},
|
||||
{
|
||||
'id': 'infinite-scroll',
|
||||
'name': 'Infinite Scroll',
|
||||
'description': 'Load products with scrolling',
|
||||
'script': '''# Navigate to catalog and scroll
|
||||
CLICK `#catalog-link`
|
||||
WAIT `.product-grid` 3
|
||||
|
||||
# Scroll multiple times to load products
|
||||
SCROLL DOWN 1000
|
||||
WAIT 1
|
||||
SCROLL DOWN 1000
|
||||
WAIT 1
|
||||
SCROLL DOWN 1000'''
|
||||
},
|
||||
{
|
||||
'id': 'form-wizard',
|
||||
'name': 'Multi-step Form',
|
||||
'description': 'Complete a multi-step survey',
|
||||
'script': '''# Navigate to forms
|
||||
CLICK `a[href="#forms"]`
|
||||
WAIT `#survey-form` 2
|
||||
|
||||
# Step 1: Basic info
|
||||
CLICK `#full-name`
|
||||
TYPE "John Doe"
|
||||
CLICK `#survey-email`
|
||||
TYPE "john@example.com"
|
||||
CLICK `.next-step`
|
||||
WAIT 1
|
||||
|
||||
# Step 2: Preferences
|
||||
CLICK `#interests`
|
||||
CLICK `option[value="tech"]`
|
||||
CLICK `option[value="music"]`
|
||||
CLICK `.next-step`
|
||||
WAIT 1
|
||||
|
||||
# Step 3: Submit
|
||||
CLICK `#submit-survey`
|
||||
WAIT `.success-message` 5'''
|
||||
}
|
||||
]
|
||||
|
||||
return jsonify(examples)
|
||||
|
||||
if __name__ == '__main__':
|
||||
port = int(os.environ.get('PORT', 8080))
|
||||
print(f"""
|
||||
╔══════════════════════════════════════════════════════════╗
|
||||
║ C4A-Script Interactive Tutorial Server ║
|
||||
╠══════════════════════════════════════════════════════════╣
|
||||
║ ║
|
||||
║ Server running at: http://localhost:{port:<6} ║
|
||||
║ ║
|
||||
║ Features: ║
|
||||
║ • C4A-Script compilation API ║
|
||||
║ • Interactive playground ║
|
||||
║ • Real-time execution visualization ║
|
||||
║ ║
|
||||
║ C4A Compiler: {'✓ Available' if C4A_AVAILABLE else '✗ Using mock compiler':<30} ║
|
||||
║ ║
|
||||
╚══════════════════════════════════════════════════════════╝
|
||||
""")
|
||||
|
||||
app.run(host='0.0.0.0', port=port, debug=True)
|
||||
69
docs/md_v2/apps/c4a-script/test_blockly.html
Normal file
@@ -0,0 +1,69 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Blockly Test</title>
|
||||
<style>
|
||||
body {
|
||||
margin: 0;
|
||||
padding: 20px;
|
||||
background: #0e0e10;
|
||||
color: #e0e0e0;
|
||||
font-family: monospace;
|
||||
}
|
||||
#blocklyDiv {
|
||||
height: 600px;
|
||||
width: 100%;
|
||||
border: 1px solid #2a2a2c;
|
||||
}
|
||||
#output {
|
||||
margin-top: 20px;
|
||||
padding: 15px;
|
||||
background: #1a1a1b;
|
||||
border: 1px solid #2a2a2c;
|
||||
white-space: pre-wrap;
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<h1>C4A-Script Blockly Test</h1>
|
||||
<div id="blocklyDiv"></div>
|
||||
<div id="output">
|
||||
<h3>Generated C4A-Script:</h3>
|
||||
<pre id="code-output"></pre>
|
||||
</div>
|
||||
|
||||
<script src="https://unpkg.com/blockly/blockly.min.js"></script>
|
||||
<script src="assets/c4a-blocks.js"></script>
|
||||
<script>
|
||||
// Simple test
|
||||
const workspace = Blockly.inject('blocklyDiv', {
|
||||
toolbox: `
|
||||
<xml>
|
||||
<category name="Test" colour="#1E88E5">
|
||||
<block type="c4a_go"></block>
|
||||
<block type="c4a_wait_time"></block>
|
||||
<block type="c4a_click"></block>
|
||||
</category>
|
||||
</xml>
|
||||
`,
|
||||
theme: Blockly.Theme.defineTheme('dark', {
|
||||
'base': Blockly.Themes.Classic,
|
||||
'componentStyles': {
|
||||
'workspaceBackgroundColour': '#0e0e10',
|
||||
'toolboxBackgroundColour': '#1a1a1b',
|
||||
'toolboxForegroundColour': '#e0e0e0',
|
||||
'flyoutBackgroundColour': '#1a1a1b',
|
||||
'flyoutForegroundColour': '#e0e0e0',
|
||||
}
|
||||
})
|
||||
});
|
||||
|
||||
workspace.addChangeListener((event) => {
|
||||
const code = Blockly.JavaScript.workspaceToCode(workspace);
|
||||
document.getElementById('code-output').textContent = code;
|
||||
});
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
124
docs/md_v2/apps/crawl4ai-assistant/README.md
Normal file
@@ -0,0 +1,124 @@
|
||||
# Crawl4AI Chrome Extension
|
||||
|
||||
Visual extraction tools for Crawl4AI - Click to extract data and content from any webpage!
|
||||
|
||||
## 🚀 Features
|
||||
|
||||
- **Click2Crawl**: Click on elements to build data extraction schemas instantly
|
||||
- **Markdown Extraction**: Select elements and export as clean markdown
|
||||
- **Script Builder (Alpha)**: Record browser actions to create automation scripts
|
||||
- **Smart Element Selection**: Container and field selection with visual feedback
|
||||
- **Code Generation**: Generates complete Python code for Crawl4AI
|
||||
- **Beautiful Dark UI**: Consistent with Crawl4AI's design language
|
||||
|
||||
## 📦 Installation
|
||||
|
||||
### Method 1: Load Unpacked Extension (Recommended for Development)
|
||||
|
||||
1. Open Chrome and navigate to `chrome://extensions/`
|
||||
2. Enable "Developer mode" in the top right corner
|
||||
3. Click "Load unpacked"
|
||||
4. Select the `crawl4ai-assistant` folder
|
||||
5. The extension icon (🚀🤖) will appear in your toolbar
|
||||
|
||||
### Method 2: Generate Icons First
|
||||
|
||||
If you want proper icons:
|
||||
|
||||
1. Open `icons/generate_icons.html` in your browser
|
||||
2. Right-click each canvas and save as:
|
||||
- `icon-16.png`
|
||||
- `icon-48.png`
|
||||
- `icon-128.png`
|
||||
3. Then follow Method 1 above
|
||||
|
||||
## 🎯 How to Use
|
||||
|
||||
### Using Click2Crawl
|
||||
|
||||
1. **Navigate to any website** you want to extract data from
|
||||
2. **Click the Crawl4AI extension icon** in your toolbar
|
||||
3. **Click "Click2Crawl"** to start the capture mode
|
||||
4. **Select a container element**:
|
||||
- Hover over elements (they'll highlight in blue)
|
||||
- Click on a repeating container (e.g., product card, article block)
|
||||
5. **Select fields within the container**:
|
||||
- Elements will now highlight in green
|
||||
- Click on each piece of data you want to extract
|
||||
- Name each field (e.g., "title", "price", "description")
|
||||
6. **Test and Export**:
|
||||
- Click "Test Schema" to see extracted data instantly
|
||||
- Export as Python code, JSON schema, or markdown
|
||||
|
||||
### Running the Generated Code
|
||||
|
||||
The downloaded Python file contains:
|
||||
|
||||
```python
|
||||
# 1. The HTML snippet of your selected container
|
||||
HTML_SNIPPET = """..."""
|
||||
|
||||
# 2. The extraction query based on your selections
|
||||
EXTRACTION_QUERY = """..."""
|
||||
|
||||
# 3. Functions to generate and test the schema
|
||||
async def generate_schema():
|
||||
# Generates the extraction schema using LLM
|
||||
|
||||
async def test_extraction():
|
||||
# Tests the schema on the actual website
|
||||
```
|
||||
|
||||
To use it:
|
||||
|
||||
1. Install Crawl4AI: `pip install crawl4ai`
|
||||
2. Run the script: `python crawl4ai_schema_*.py`
|
||||
3. The script will generate a `generated_schema.json` file
|
||||
4. Use this schema in your Crawl4AI projects!
|
||||
|
||||
## 🎨 Visual Feedback
|
||||
|
||||
- **Blue dashed outline**: Container selection mode
|
||||
- **Green dashed outline**: Field selection mode
|
||||
- **Solid blue outline**: Selected container
|
||||
- **Solid green outline**: Selected fields
|
||||
- **Floating toolbar**: Shows current mode and selection status
|
||||
|
||||
## ⌨️ Keyboard Shortcuts
|
||||
|
||||
- **ESC**: Cancel current capture session
|
||||
|
||||
## 🔧 Technical Details
|
||||
|
||||
- Built with Manifest V3 for security and performance
|
||||
- Pure client-side - no data sent to external servers
|
||||
- Generates code that uses Crawl4AI's LLM integration
|
||||
- Smart selector generation prioritizes stable attributes
|
||||
|
||||
## 🐛 Troubleshooting
|
||||
|
||||
### Extension doesn't load
|
||||
- Make sure you're in Developer Mode
|
||||
- Check the console for any errors
|
||||
- Ensure all files are in the correct directories
|
||||
|
||||
### Can't select elements
|
||||
- Some websites may block extensions
|
||||
- Try refreshing the page
|
||||
- Make sure you clicked "Schema Builder" first
|
||||
|
||||
### Generated code doesn't work
|
||||
- Ensure you have Crawl4AI installed
|
||||
- Check that you have an LLM API key configured
|
||||
- Make sure the website structure hasn't changed
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
This extension is part of the Crawl4AI project. Contributions are welcome!
|
||||
|
||||
- Report issues: [GitHub Issues](https://github.com/unclecode/crawl4ai/issues)
|
||||
- Join discussion: [Discord](https://discord.gg/crawl4ai)
|
||||
|
||||
## 📄 License
|
||||
|
||||
Same as Crawl4AI - see main project for details.
|
||||
BIN
docs/md_v2/apps/crawl4ai-assistant/assets/DankMono-Bold.woff2
Normal file
BIN
docs/md_v2/apps/crawl4ai-assistant/assets/DankMono-Italic.woff2
Normal file
BIN
docs/md_v2/apps/crawl4ai-assistant/assets/DankMono-Regular.woff2
Normal file
1551
docs/md_v2/apps/crawl4ai-assistant/assistant.css
Normal file
@@ -0,0 +1,39 @@
|
||||
// Service worker for Crawl4AI Assistant
|
||||
|
||||
// Handle messages from content script
|
||||
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
|
||||
if (message.action === 'downloadCode' || message.action === 'downloadScript') {
|
||||
try {
|
||||
// Create a data URL for the Python code
|
||||
const dataUrl = 'data:text/plain;charset=utf-8,' + encodeURIComponent(message.code);
|
||||
|
||||
// Download the file
|
||||
chrome.downloads.download({
|
||||
url: dataUrl,
|
||||
filename: message.filename || 'crawl4ai_schema.py',
|
||||
saveAs: true
|
||||
}, (downloadId) => {
|
||||
if (chrome.runtime.lastError) {
|
||||
console.error('Download failed:', chrome.runtime.lastError);
|
||||
sendResponse({ success: false, error: chrome.runtime.lastError.message });
|
||||
} else {
|
||||
console.log('Download started with ID:', downloadId);
|
||||
sendResponse({ success: true, downloadId: downloadId });
|
||||
}
|
||||
});
|
||||
} catch (error) {
|
||||
console.error('Error creating download:', error);
|
||||
sendResponse({ success: false, error: error.message });
|
||||
}
|
||||
|
||||
return true; // Keep the message channel open for async response
|
||||
}
|
||||
|
||||
return false;
|
||||
});
|
||||
|
||||
// Clean up on extension install/update
|
||||
chrome.runtime.onInstalled.addListener(() => {
|
||||
// Clear any stored state
|
||||
chrome.storage.local.clear();
|
||||
});
|
||||
1968
docs/md_v2/apps/crawl4ai-assistant/content/click2crawl.js
Normal file
78
docs/md_v2/apps/crawl4ai-assistant/content/content.js
Normal file
@@ -0,0 +1,78 @@
|
||||
// Main content script for Crawl4AI Assistant
|
||||
// Coordinates between Click2Crawl, ScriptBuilder, and MarkdownExtraction
|
||||
|
||||
let activeBuilder = null;
|
||||
|
||||
// Listen for messages from popup
|
||||
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
|
||||
if (request.action === 'startCapture') {
|
||||
if (activeBuilder) {
|
||||
console.log('Stopping existing capture session');
|
||||
activeBuilder.stop();
|
||||
activeBuilder = null;
|
||||
}
|
||||
|
||||
if (request.mode === 'schema') {
|
||||
console.log('Starting Click2Crawl');
|
||||
activeBuilder = new Click2Crawl();
|
||||
activeBuilder.start();
|
||||
} else if (request.mode === 'script') {
|
||||
console.log('Starting Script Builder');
|
||||
activeBuilder = new ScriptBuilder();
|
||||
activeBuilder.start();
|
||||
}
|
||||
|
||||
sendResponse({ success: true });
|
||||
} else if (request.action === 'stopCapture') {
|
||||
if (activeBuilder) {
|
||||
activeBuilder.stop();
|
||||
activeBuilder = null;
|
||||
}
|
||||
sendResponse({ success: true });
|
||||
} else if (request.action === 'startSchemaCapture') {
|
||||
if (activeBuilder) {
|
||||
activeBuilder.deactivate?.();
|
||||
activeBuilder = null;
|
||||
}
|
||||
console.log('Starting Click2Crawl');
|
||||
activeBuilder = new Click2Crawl();
|
||||
activeBuilder.start();
|
||||
sendResponse({ success: true });
|
||||
} else if (request.action === 'startScriptCapture') {
|
||||
if (activeBuilder) {
|
||||
activeBuilder.deactivate?.();
|
||||
activeBuilder = null;
|
||||
}
|
||||
console.log('Starting Script Builder');
|
||||
activeBuilder = new ScriptBuilder();
|
||||
activeBuilder.start();
|
||||
sendResponse({ success: true });
|
||||
} else if (request.action === 'startClick2Crawl') {
|
||||
if (activeBuilder) {
|
||||
activeBuilder.deactivate?.();
|
||||
activeBuilder = null;
|
||||
}
|
||||
console.log('Starting Markdown Extraction');
|
||||
activeBuilder = new MarkdownExtraction();
|
||||
sendResponse({ success: true });
|
||||
} else if (request.action === 'generateCode') {
|
||||
if (activeBuilder && activeBuilder.generateCode) {
|
||||
activeBuilder.generateCode();
|
||||
}
|
||||
sendResponse({ success: true });
|
||||
}
|
||||
});
|
||||
|
||||
// Cleanup on page unload
|
||||
window.addEventListener('beforeunload', () => {
|
||||
if (activeBuilder) {
|
||||
if (activeBuilder.deactivate) {
|
||||
activeBuilder.deactivate();
|
||||
} else if (activeBuilder.stop) {
|
||||
activeBuilder.stop();
|
||||
}
|
||||
activeBuilder = null;
|
||||
}
|
||||
});
|
||||
|
||||
console.log('Crawl4AI Assistant content script loaded');
|
||||
623
docs/md_v2/apps/crawl4ai-assistant/content/contentAnalyzer.js
Normal file
@@ -0,0 +1,623 @@
|
||||
class ContentAnalyzer {
|
||||
constructor() {
|
||||
this.patterns = {
|
||||
article: ['article', 'main', 'content', 'post', 'entry'],
|
||||
navigation: ['nav', 'menu', 'navigation', 'breadcrumb'],
|
||||
sidebar: ['sidebar', 'aside', 'widget'],
|
||||
header: ['header', 'masthead', 'banner'],
|
||||
footer: ['footer', 'copyright', 'contact'],
|
||||
list: ['list', 'items', 'results', 'products', 'cards'],
|
||||
table: ['table', 'grid', 'data'],
|
||||
media: ['gallery', 'carousel', 'slideshow', 'video', 'media']
|
||||
};
|
||||
}
|
||||
|
||||
async analyze(elements) {
|
||||
const analysis = {
|
||||
structure: await this.analyzeStructure(elements),
|
||||
contentType: this.identifyContentType(elements),
|
||||
hierarchy: this.buildHierarchy(elements),
|
||||
mediaAssets: this.collectMediaAssets(elements),
|
||||
textDensity: this.calculateTextDensity(elements),
|
||||
semanticRegions: this.identifySemanticRegions(elements),
|
||||
relationships: this.analyzeRelationships(elements),
|
||||
metadata: this.extractMetadata(elements)
|
||||
};
|
||||
|
||||
return analysis;
|
||||
}
|
||||
|
||||
analyzeStructure(elements) {
|
||||
const structure = {
|
||||
hasHeadings: false,
|
||||
hasLists: false,
|
||||
hasTables: false,
|
||||
hasMedia: false,
|
||||
hasCode: false,
|
||||
hasLinks: false,
|
||||
layout: 'linear', // linear, grid, mixed
|
||||
depth: 0,
|
||||
elementTypes: new Map()
|
||||
};
|
||||
|
||||
// Analyze each element
|
||||
for (const element of elements) {
|
||||
this.analyzeElementStructure(element, structure);
|
||||
}
|
||||
|
||||
// Determine layout type
|
||||
structure.layout = this.determineLayout(elements);
|
||||
|
||||
// Calculate max depth
|
||||
structure.depth = this.calculateMaxDepth(elements);
|
||||
|
||||
return structure;
|
||||
}
|
||||
|
||||
analyzeElementStructure(element, structure, visited = new Set()) {
|
||||
if (visited.has(element)) return;
|
||||
visited.add(element);
|
||||
|
||||
const tagName = element.tagName;
|
||||
|
||||
// Update element type count
|
||||
structure.elementTypes.set(
|
||||
tagName,
|
||||
(structure.elementTypes.get(tagName) || 0) + 1
|
||||
);
|
||||
|
||||
// Check for specific structures
|
||||
if (/^H[1-6]$/.test(tagName)) {
|
||||
structure.hasHeadings = true;
|
||||
} else if (['UL', 'OL', 'DL'].includes(tagName)) {
|
||||
structure.hasLists = true;
|
||||
} else if (tagName === 'TABLE') {
|
||||
structure.hasTables = true;
|
||||
} else if (['IMG', 'VIDEO', 'IFRAME', 'PICTURE'].includes(tagName)) {
|
||||
structure.hasMedia = true;
|
||||
} else if (['CODE', 'PRE'].includes(tagName)) {
|
||||
structure.hasCode = true;
|
||||
} else if (tagName === 'A') {
|
||||
structure.hasLinks = true;
|
||||
}
|
||||
|
||||
// Analyze children
|
||||
for (const child of element.children) {
|
||||
this.analyzeElementStructure(child, structure, visited);
|
||||
}
|
||||
}
|
||||
|
||||
identifyContentType(elements) {
|
||||
const scores = {
|
||||
article: 0,
|
||||
list: 0,
|
||||
table: 0,
|
||||
form: 0,
|
||||
media: 0,
|
||||
mixed: 0
|
||||
};
|
||||
|
||||
for (const element of elements) {
|
||||
// Score based on element types and classes
|
||||
const tagName = element.tagName;
|
||||
const className = element.className.toLowerCase();
|
||||
const id = element.id.toLowerCase();
|
||||
|
||||
// Check for article patterns
|
||||
if (tagName === 'ARTICLE' ||
|
||||
this.matchesPattern(className + ' ' + id, this.patterns.article)) {
|
||||
scores.article += 10;
|
||||
}
|
||||
|
||||
// Check for list patterns
|
||||
if (['UL', 'OL'].includes(tagName) ||
|
||||
this.matchesPattern(className, this.patterns.list)) {
|
||||
scores.list += 5;
|
||||
}
|
||||
|
||||
// Check for table
|
||||
if (tagName === 'TABLE') {
|
||||
scores.table += 10;
|
||||
}
|
||||
|
||||
// Check for form
|
||||
if (tagName === 'FORM' || element.querySelector('input, select, textarea')) {
|
||||
scores.form += 5;
|
||||
}
|
||||
|
||||
// Check for media gallery
|
||||
if (this.matchesPattern(className, this.patterns.media) ||
|
||||
element.querySelectorAll('img, video').length > 3) {
|
||||
scores.media += 5;
|
||||
}
|
||||
}
|
||||
|
||||
// Determine primary content type
|
||||
const maxScore = Math.max(...Object.values(scores));
|
||||
if (maxScore === 0) return 'unknown';
|
||||
|
||||
for (const [type, score] of Object.entries(scores)) {
|
||||
if (score === maxScore) {
|
||||
return type;
|
||||
}
|
||||
}
|
||||
|
||||
return 'mixed';
|
||||
}
|
||||
|
||||
buildHierarchy(elements) {
|
||||
const hierarchy = {
|
||||
root: null,
|
||||
levels: [],
|
||||
headingStructure: []
|
||||
};
|
||||
|
||||
// Find common ancestor
|
||||
if (elements.length > 0) {
|
||||
hierarchy.root = this.findCommonAncestor(elements);
|
||||
}
|
||||
|
||||
// Build heading hierarchy
|
||||
const headings = [];
|
||||
for (const element of elements) {
|
||||
const foundHeadings = element.querySelectorAll('h1, h2, h3, h4, h5, h6');
|
||||
headings.push(...Array.from(foundHeadings));
|
||||
}
|
||||
|
||||
// Sort headings by document position
|
||||
headings.sort((a, b) => {
|
||||
const position = a.compareDocumentPosition(b);
|
||||
if (position & Node.DOCUMENT_POSITION_FOLLOWING) {
|
||||
return -1;
|
||||
} else if (position & Node.DOCUMENT_POSITION_PRECEDING) {
|
||||
return 1;
|
||||
}
|
||||
return 0;
|
||||
});
|
||||
|
||||
// Build heading structure
|
||||
let currentLevel = 0;
|
||||
const stack = [];
|
||||
|
||||
for (const heading of headings) {
|
||||
const level = parseInt(heading.tagName.substring(1));
|
||||
const item = {
|
||||
level,
|
||||
text: heading.textContent.trim(),
|
||||
element: heading,
|
||||
children: []
|
||||
};
|
||||
|
||||
// Find parent in stack
|
||||
while (stack.length > 0 && stack[stack.length - 1].level >= level) {
|
||||
stack.pop();
|
||||
}
|
||||
|
||||
if (stack.length > 0) {
|
||||
stack[stack.length - 1].children.push(item);
|
||||
} else {
|
||||
hierarchy.headingStructure.push(item);
|
||||
}
|
||||
|
||||
stack.push(item);
|
||||
}
|
||||
|
||||
return hierarchy;
|
||||
}
|
||||
|
||||
collectMediaAssets(elements) {
|
||||
const media = {
|
||||
images: [],
|
||||
videos: [],
|
||||
iframes: [],
|
||||
audio: []
|
||||
};
|
||||
|
||||
for (const element of elements) {
|
||||
// Collect images
|
||||
const images = element.querySelectorAll('img');
|
||||
for (const img of images) {
|
||||
media.images.push({
|
||||
src: img.src,
|
||||
alt: img.alt,
|
||||
title: img.title,
|
||||
width: img.width,
|
||||
height: img.height,
|
||||
element: img
|
||||
});
|
||||
}
|
||||
|
||||
// Collect videos
|
||||
const videos = element.querySelectorAll('video');
|
||||
for (const video of videos) {
|
||||
media.videos.push({
|
||||
src: video.src,
|
||||
poster: video.poster,
|
||||
width: video.width,
|
||||
height: video.height,
|
||||
element: video
|
||||
});
|
||||
}
|
||||
|
||||
// Collect iframes
|
||||
const iframes = element.querySelectorAll('iframe');
|
||||
for (const iframe of iframes) {
|
||||
media.iframes.push({
|
||||
src: iframe.src,
|
||||
width: iframe.width,
|
||||
height: iframe.height,
|
||||
title: iframe.title,
|
||||
element: iframe
|
||||
});
|
||||
}
|
||||
|
||||
// Collect audio
|
||||
const audios = element.querySelectorAll('audio');
|
||||
for (const audio of audios) {
|
||||
media.audio.push({
|
||||
src: audio.src,
|
||||
element: audio
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
return media;
|
||||
}
|
||||
|
||||
calculateTextDensity(elements) {
|
||||
let totalText = 0;
|
||||
let totalElements = 0;
|
||||
let linkText = 0;
|
||||
let codeText = 0;
|
||||
|
||||
for (const element of elements) {
|
||||
const stats = this.getTextStats(element);
|
||||
totalText += stats.textLength;
|
||||
totalElements += stats.elementCount;
|
||||
linkText += stats.linkTextLength;
|
||||
codeText += stats.codeTextLength;
|
||||
}
|
||||
|
||||
return {
|
||||
textLength: totalText,
|
||||
elementCount: totalElements,
|
||||
averageTextPerElement: totalElements > 0 ? totalText / totalElements : 0,
|
||||
linkDensity: totalText > 0 ? linkText / totalText : 0,
|
||||
codeDensity: totalText > 0 ? codeText / totalText : 0
|
||||
};
|
||||
}
|
||||
|
||||
getTextStats(element, visited = new Set()) {
|
||||
if (visited.has(element)) {
|
||||
return { textLength: 0, elementCount: 0, linkTextLength: 0, codeTextLength: 0 };
|
||||
}
|
||||
visited.add(element);
|
||||
|
||||
let stats = {
|
||||
textLength: 0,
|
||||
elementCount: 1,
|
||||
linkTextLength: 0,
|
||||
codeTextLength: 0
|
||||
};
|
||||
|
||||
// Get direct text content
|
||||
for (const node of element.childNodes) {
|
||||
if (node.nodeType === Node.TEXT_NODE) {
|
||||
const text = node.textContent.trim();
|
||||
stats.textLength += text.length;
|
||||
|
||||
// Check if this text is within a link
|
||||
if (element.tagName === 'A') {
|
||||
stats.linkTextLength += text.length;
|
||||
}
|
||||
|
||||
// Check if this text is within code
|
||||
if (['CODE', 'PRE'].includes(element.tagName)) {
|
||||
stats.codeTextLength += text.length;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Process children
|
||||
for (const child of element.children) {
|
||||
const childStats = this.getTextStats(child, visited);
|
||||
stats.textLength += childStats.textLength;
|
||||
stats.elementCount += childStats.elementCount;
|
||||
stats.linkTextLength += childStats.linkTextLength;
|
||||
stats.codeTextLength += childStats.codeTextLength;
|
||||
}
|
||||
|
||||
return stats;
|
||||
}
|
||||
|
||||
identifySemanticRegions(elements) {
|
||||
const regions = {
|
||||
headers: [],
|
||||
navigation: [],
|
||||
main: [],
|
||||
sidebars: [],
|
||||
footers: [],
|
||||
articles: []
|
||||
};
|
||||
|
||||
for (const element of elements) {
|
||||
// Check element and its ancestors for semantic regions
|
||||
let current = element;
|
||||
while (current) {
|
||||
const tagName = current.tagName;
|
||||
const className = current.className.toLowerCase();
|
||||
const role = current.getAttribute('role');
|
||||
|
||||
// Check semantic HTML5 elements
|
||||
if (tagName === 'HEADER' || role === 'banner') {
|
||||
regions.headers.push(current);
|
||||
} else if (tagName === 'NAV' || role === 'navigation') {
|
||||
regions.navigation.push(current);
|
||||
} else if (tagName === 'MAIN' || role === 'main') {
|
||||
regions.main.push(current);
|
||||
} else if (tagName === 'ASIDE' || role === 'complementary') {
|
||||
regions.sidebars.push(current);
|
||||
} else if (tagName === 'FOOTER' || role === 'contentinfo') {
|
||||
regions.footers.push(current);
|
||||
} else if (tagName === 'ARTICLE' || role === 'article') {
|
||||
regions.articles.push(current);
|
||||
}
|
||||
|
||||
// Check class patterns
|
||||
if (this.matchesPattern(className, this.patterns.header)) {
|
||||
regions.headers.push(current);
|
||||
} else if (this.matchesPattern(className, this.patterns.navigation)) {
|
||||
regions.navigation.push(current);
|
||||
} else if (this.matchesPattern(className, this.patterns.sidebar)) {
|
||||
regions.sidebars.push(current);
|
||||
} else if (this.matchesPattern(className, this.patterns.footer)) {
|
||||
regions.footers.push(current);
|
||||
}
|
||||
|
||||
current = current.parentElement;
|
||||
}
|
||||
}
|
||||
|
||||
// Deduplicate
|
||||
for (const key of Object.keys(regions)) {
|
||||
regions[key] = Array.from(new Set(regions[key]));
|
||||
}
|
||||
|
||||
return regions;
|
||||
}
|
||||
|
||||
analyzeRelationships(elements) {
|
||||
const relationships = {
|
||||
siblings: [],
|
||||
parents: [],
|
||||
children: [],
|
||||
relatedByClass: new Map(),
|
||||
relatedByStructure: []
|
||||
};
|
||||
|
||||
// Find sibling relationships
|
||||
for (let i = 0; i < elements.length; i++) {
|
||||
for (let j = i + 1; j < elements.length; j++) {
|
||||
if (elements[i].parentElement === elements[j].parentElement) {
|
||||
relationships.siblings.push([elements[i], elements[j]]);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Find parent-child relationships
|
||||
for (const element of elements) {
|
||||
for (const other of elements) {
|
||||
if (element !== other) {
|
||||
if (element.contains(other)) {
|
||||
relationships.parents.push({ parent: element, child: other });
|
||||
} else if (other.contains(element)) {
|
||||
relationships.children.push({ parent: other, child: element });
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Group by similar classes
|
||||
for (const element of elements) {
|
||||
const classes = Array.from(element.classList);
|
||||
for (const className of classes) {
|
||||
if (!relationships.relatedByClass.has(className)) {
|
||||
relationships.relatedByClass.set(className, []);
|
||||
}
|
||||
relationships.relatedByClass.get(className).push(element);
|
||||
}
|
||||
}
|
||||
|
||||
// Find structurally similar elements
|
||||
for (let i = 0; i < elements.length; i++) {
|
||||
for (let j = i + 1; j < elements.length; j++) {
|
||||
if (this.areStructurallySimilar(elements[i], elements[j])) {
|
||||
relationships.relatedByStructure.push([elements[i], elements[j]]);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return relationships;
|
||||
}
|
||||
|
||||
areStructurallySimilar(element1, element2) {
|
||||
// Same tag name
|
||||
if (element1.tagName !== element2.tagName) {
|
||||
return false;
|
||||
}
|
||||
|
||||
// Similar class structure
|
||||
const classes1 = Array.from(element1.classList).sort();
|
||||
const classes2 = Array.from(element2.classList).sort();
|
||||
|
||||
// At least 50% overlap in classes
|
||||
const intersection = classes1.filter(c => classes2.includes(c));
|
||||
const union = Array.from(new Set([...classes1, ...classes2]));
|
||||
|
||||
if (union.length > 0 && intersection.length / union.length >= 0.5) {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Similar child structure
|
||||
if (element1.children.length === element2.children.length) {
|
||||
const childTags1 = Array.from(element1.children).map(c => c.tagName).sort();
|
||||
const childTags2 = Array.from(element2.children).map(c => c.tagName).sort();
|
||||
|
||||
if (JSON.stringify(childTags1) === JSON.stringify(childTags2)) {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
extractMetadata(elements) {
|
||||
const metadata = {
|
||||
title: null,
|
||||
description: null,
|
||||
author: null,
|
||||
date: null,
|
||||
tags: [],
|
||||
microdata: []
|
||||
};
|
||||
|
||||
for (const element of elements) {
|
||||
// Look for title
|
||||
const h1 = element.querySelector('h1');
|
||||
if (h1 && !metadata.title) {
|
||||
metadata.title = h1.textContent.trim();
|
||||
}
|
||||
|
||||
// Look for meta information
|
||||
const metaElements = element.querySelectorAll('[itemprop], [property], [name]');
|
||||
for (const meta of metaElements) {
|
||||
const prop = meta.getAttribute('itemprop') ||
|
||||
meta.getAttribute('property') ||
|
||||
meta.getAttribute('name');
|
||||
const content = meta.getAttribute('content') || meta.textContent.trim();
|
||||
|
||||
if (prop && content) {
|
||||
if (prop.includes('author')) {
|
||||
metadata.author = content;
|
||||
} else if (prop.includes('date') || prop.includes('time')) {
|
||||
metadata.date = content;
|
||||
} else if (prop.includes('description')) {
|
||||
metadata.description = content;
|
||||
} else if (prop.includes('tag') || prop.includes('keyword')) {
|
||||
metadata.tags.push(content);
|
||||
}
|
||||
|
||||
metadata.microdata.push({ property: prop, value: content });
|
||||
}
|
||||
}
|
||||
|
||||
// Look for time elements
|
||||
const timeElements = element.querySelectorAll('time');
|
||||
for (const time of timeElements) {
|
||||
if (!metadata.date && time.dateTime) {
|
||||
metadata.date = time.dateTime;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return metadata;
|
||||
}
|
||||
|
||||
determineLayout(elements) {
|
||||
// Check if elements form a grid
|
||||
const positions = elements.map(el => {
|
||||
const rect = el.getBoundingClientRect();
|
||||
return { x: rect.left, y: rect.top, width: rect.width, height: rect.height };
|
||||
});
|
||||
|
||||
// Check for grid layout (multiple elements on same row)
|
||||
const rows = new Map();
|
||||
for (const pos of positions) {
|
||||
const row = Math.round(pos.y / 10) * 10; // Round to nearest 10px
|
||||
if (!rows.has(row)) {
|
||||
rows.set(row, []);
|
||||
}
|
||||
rows.get(row).push(pos);
|
||||
}
|
||||
|
||||
// If multiple elements share rows, it's likely a grid
|
||||
const hasGrid = Array.from(rows.values()).some(row => row.length > 1);
|
||||
|
||||
if (hasGrid) {
|
||||
return 'grid';
|
||||
}
|
||||
|
||||
// Check for mixed layout (significant variation in widths)
|
||||
const widths = positions.map(p => p.width);
|
||||
const avgWidth = widths.reduce((a, b) => a + b, 0) / widths.length;
|
||||
const variance = widths.reduce((sum, w) => sum + Math.pow(w - avgWidth, 2), 0) / widths.length;
|
||||
const stdDev = Math.sqrt(variance);
|
||||
|
||||
if (stdDev / avgWidth > 0.3) {
|
||||
return 'mixed';
|
||||
}
|
||||
|
||||
return 'linear';
|
||||
}
|
||||
|
||||
calculateMaxDepth(elements) {
|
||||
let maxDepth = 0;
|
||||
|
||||
for (const element of elements) {
|
||||
const depth = this.getElementDepth(element);
|
||||
maxDepth = Math.max(maxDepth, depth);
|
||||
}
|
||||
|
||||
return maxDepth;
|
||||
}
|
||||
|
||||
getElementDepth(element, depth = 0) {
|
||||
if (element.children.length === 0) {
|
||||
return depth;
|
||||
}
|
||||
|
||||
let maxChildDepth = depth;
|
||||
for (const child of element.children) {
|
||||
const childDepth = this.getElementDepth(child, depth + 1);
|
||||
maxChildDepth = Math.max(maxChildDepth, childDepth);
|
||||
}
|
||||
|
||||
return maxChildDepth;
|
||||
}
|
||||
|
||||
findCommonAncestor(elements) {
|
||||
if (elements.length === 0) return null;
|
||||
if (elements.length === 1) return elements[0].parentElement;
|
||||
|
||||
// Start with the first element's ancestors
|
||||
let ancestor = elements[0];
|
||||
const ancestors = [];
|
||||
|
||||
while (ancestor) {
|
||||
ancestors.push(ancestor);
|
||||
ancestor = ancestor.parentElement;
|
||||
}
|
||||
|
||||
// Find the deepest common ancestor
|
||||
for (const ancestorCandidate of ancestors) {
|
||||
let isCommon = true;
|
||||
|
||||
for (const element of elements) {
|
||||
if (!ancestorCandidate.contains(element)) {
|
||||
isCommon = false;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
if (isCommon) {
|
||||
return ancestorCandidate;
|
||||
}
|
||||
}
|
||||
|
||||
return document.body;
|
||||
}
|
||||
|
||||
matchesPattern(text, patterns) {
|
||||
return patterns.some(pattern => text.includes(pattern));
|
||||
}
|
||||
}
|
||||
718
docs/md_v2/apps/crawl4ai-assistant/content/markdownConverter.js
Normal file
@@ -0,0 +1,718 @@
|
||||
class MarkdownConverter {
|
||||
constructor() {
|
||||
// Conversion handlers for different element types
|
||||
this.converters = {
|
||||
'H1': async (el, ctx) => await this.convertHeading(el, 1, ctx),
|
||||
'H2': async (el, ctx) => await this.convertHeading(el, 2, ctx),
|
||||
'H3': async (el, ctx) => await this.convertHeading(el, 3, ctx),
|
||||
'H4': async (el, ctx) => await this.convertHeading(el, 4, ctx),
|
||||
'H5': async (el, ctx) => await this.convertHeading(el, 5, ctx),
|
||||
'H6': async (el, ctx) => await this.convertHeading(el, 6, ctx),
|
||||
'P': async (el, ctx) => await this.convertParagraph(el, ctx),
|
||||
'A': async (el, ctx) => await this.convertLink(el, ctx),
|
||||
'IMG': async (el, ctx) => await this.convertImage(el, ctx),
|
||||
'UL': async (el, ctx) => await this.convertList(el, 'ul', ctx),
|
||||
'OL': async (el, ctx) => await this.convertList(el, 'ol', ctx),
|
||||
'LI': async (el, ctx) => await this.convertListItem(el, ctx),
|
||||
'TABLE': async (el, ctx) => await this.convertTable(el, ctx),
|
||||
'BLOCKQUOTE': async (el, ctx) => await this.convertBlockquote(el, ctx),
|
||||
'PRE': async (el, ctx) => await this.convertPreformatted(el, ctx),
|
||||
'CODE': async (el, ctx) => await this.convertCode(el, ctx),
|
||||
'HR': async (el, ctx) => '\n---\n',
|
||||
'BR': async (el, ctx) => ' \n',
|
||||
'STRONG': async (el, ctx) => `**${await this.getTextContent(el, ctx)}**`,
|
||||
'B': async (el, ctx) => `**${await this.getTextContent(el, ctx)}**`,
|
||||
'EM': async (el, ctx) => `*${await this.getTextContent(el, ctx)}*`,
|
||||
'I': async (el, ctx) => `*${await this.getTextContent(el, ctx)}*`,
|
||||
'DEL': async (el, ctx) => `~~${await this.getTextContent(el, ctx)}~~`,
|
||||
'S': async (el, ctx) => `~~${await this.getTextContent(el, ctx)}~~`,
|
||||
'DIV': async (el, ctx) => await this.convertDiv(el, ctx),
|
||||
'SPAN': async (el, ctx) => await this.convertSpan(el, ctx),
|
||||
'ARTICLE': async (el, ctx) => await this.convertArticle(el, ctx),
|
||||
'SECTION': async (el, ctx) => await this.convertSection(el, ctx),
|
||||
'FIGURE': async (el, ctx) => await this.convertFigure(el, ctx),
|
||||
'FIGCAPTION': async (el, ctx) => await this.convertFigCaption(el, ctx),
|
||||
'VIDEO': async (el, ctx) => await this.convertVideo(el, ctx),
|
||||
'IFRAME': async (el, ctx) => await this.convertIframe(el, ctx),
|
||||
'DL': async (el, ctx) => await this.convertDefinitionList(el, ctx),
|
||||
'DT': async (el, ctx) => await this.convertDefinitionTerm(el, ctx),
|
||||
'DD': async (el, ctx) => await this.convertDefinitionDescription(el, ctx),
|
||||
'TR': async (el, ctx) => await this.convertTableRow(el, ctx)
|
||||
};
|
||||
|
||||
// Maintain context during conversion
|
||||
this.conversionContext = {
|
||||
listDepth: 0,
|
||||
inTable: false,
|
||||
inCode: false,
|
||||
preserveWhitespace: false,
|
||||
references: [],
|
||||
imageCount: 0,
|
||||
linkCount: 0
|
||||
};
|
||||
}
|
||||
|
||||
async convert(elements, options = {}) {
|
||||
// Reset context
|
||||
this.resetContext();
|
||||
|
||||
// Apply options
|
||||
this.options = {
|
||||
includeImages: true,
|
||||
preserveTables: true,
|
||||
keepCodeFormatting: true,
|
||||
simplifyLayout: false,
|
||||
preserveLinks: true,
|
||||
...options
|
||||
};
|
||||
|
||||
// Convert elements
|
||||
const markdownParts = [];
|
||||
|
||||
for (const element of elements) {
|
||||
const markdown = await this.convertElement(element, this.conversionContext);
|
||||
if (markdown.trim()) {
|
||||
markdownParts.push(markdown);
|
||||
}
|
||||
}
|
||||
|
||||
// Join parts with appropriate spacing
|
||||
let result = markdownParts.join('\n\n');
|
||||
|
||||
// Add references if using reference-style links
|
||||
if (this.conversionContext.references.length > 0) {
|
||||
result += '\n\n' + this.generateReferences();
|
||||
}
|
||||
|
||||
// Post-process to clean up
|
||||
result = this.postProcess(result);
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
resetContext() {
|
||||
this.conversionContext = {
|
||||
listDepth: 0,
|
||||
inTable: false,
|
||||
inCode: false,
|
||||
preserveWhitespace: false,
|
||||
references: [],
|
||||
imageCount: 0,
|
||||
linkCount: 0
|
||||
};
|
||||
}
|
||||
|
||||
async convertElement(element, context) {
|
||||
// Skip hidden elements
|
||||
if (this.isHidden(element)) {
|
||||
return '';
|
||||
}
|
||||
|
||||
// Skip script and style elements
|
||||
if (['SCRIPT', 'STYLE', 'NOSCRIPT'].includes(element.tagName)) {
|
||||
return '';
|
||||
}
|
||||
|
||||
// Get converter for this element type
|
||||
const converter = this.converters[element.tagName];
|
||||
|
||||
if (converter) {
|
||||
return await converter(element, context);
|
||||
} else {
|
||||
// For unknown elements, process children
|
||||
return await this.processChildren(element, context);
|
||||
}
|
||||
}
|
||||
|
||||
async processChildren(element, context) {
|
||||
const parts = [];
|
||||
|
||||
for (const child of element.childNodes) {
|
||||
if (child.nodeType === Node.TEXT_NODE) {
|
||||
const text = this.processTextNode(child, context);
|
||||
if (text) {
|
||||
parts.push(text);
|
||||
}
|
||||
} else if (child.nodeType === Node.ELEMENT_NODE) {
|
||||
const markdown = await this.convertElement(child, context);
|
||||
if (markdown) {
|
||||
parts.push(markdown);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return parts.join('');
|
||||
}
|
||||
|
||||
processTextNode(node, context) {
|
||||
let text = node.textContent;
|
||||
|
||||
// Preserve whitespace in code blocks
|
||||
if (!context.preserveWhitespace && !context.inCode) {
|
||||
// Normalize whitespace
|
||||
text = text.replace(/\s+/g, ' ');
|
||||
|
||||
// Trim if at block boundaries
|
||||
if (this.isBlockBoundary(node.previousSibling)) {
|
||||
text = text.trimStart();
|
||||
}
|
||||
if (this.isBlockBoundary(node.nextSibling)) {
|
||||
text = text.trimEnd();
|
||||
}
|
||||
}
|
||||
|
||||
// Escape markdown characters
|
||||
if (!context.inCode) {
|
||||
text = this.escapeMarkdown(text);
|
||||
}
|
||||
|
||||
return text;
|
||||
}
|
||||
|
||||
isBlockBoundary(node) {
|
||||
if (!node || node.nodeType !== Node.ELEMENT_NODE) {
|
||||
return true;
|
||||
}
|
||||
|
||||
const blockElements = [
|
||||
'DIV', 'P', 'H1', 'H2', 'H3', 'H4', 'H5', 'H6',
|
||||
'UL', 'OL', 'LI', 'BLOCKQUOTE', 'PRE', 'TABLE',
|
||||
'HR', 'ARTICLE', 'SECTION', 'HEADER', 'FOOTER',
|
||||
'NAV', 'ASIDE', 'MAIN'
|
||||
];
|
||||
|
||||
return blockElements.includes(node.tagName);
|
||||
}
|
||||
|
||||
escapeMarkdown(text) {
|
||||
// In text-only mode, don't escape characters
|
||||
if (this.options.textOnly) {
|
||||
return text;
|
||||
}
|
||||
|
||||
// Escape special markdown characters
|
||||
return text
|
||||
.replace(/\\/g, '\\\\')
|
||||
.replace(/\*/g, '\\*')
|
||||
.replace(/_/g, '\\_')
|
||||
.replace(/\[/g, '\\[')
|
||||
.replace(/\]/g, '\\]')
|
||||
.replace(/\(/g, '\\(')
|
||||
.replace(/\)/g, '\\)')
|
||||
.replace(/\#/g, '\\#')
|
||||
.replace(/\+/g, '\\+')
|
||||
.replace(/\-/g, '\\-')
|
||||
.replace(/\./g, '\\.')
|
||||
.replace(/\!/g, '\\!')
|
||||
.replace(/\|/g, '\\|');
|
||||
}
|
||||
|
||||
async convertHeading(element, level, context) {
|
||||
const text = await this.getTextContent(element, context);
|
||||
return '#'.repeat(level) + ' ' + text + '\n';
|
||||
}
|
||||
|
||||
async convertParagraph(element, context) {
|
||||
const content = await this.processChildren(element, context);
|
||||
return content.trim() ? content + '\n' : '';
|
||||
}
|
||||
|
||||
async convertLink(element, context) {
|
||||
if (!this.options.preserveLinks || this.options.textOnly) {
|
||||
return await this.getTextContent(element, context);
|
||||
}
|
||||
|
||||
const text = await this.getTextContent(element, context);
|
||||
const href = element.getAttribute('href');
|
||||
const title = element.getAttribute('title');
|
||||
|
||||
if (!href) {
|
||||
return text;
|
||||
}
|
||||
|
||||
// Convert relative URLs to absolute
|
||||
const absoluteUrl = this.makeAbsoluteUrl(href);
|
||||
|
||||
// Use reference-style links for cleaner markdown
|
||||
if (text && absoluteUrl) {
|
||||
if (title) {
|
||||
return `[${text}](${absoluteUrl} "${title}")`;
|
||||
} else {
|
||||
return `[${text}](${absoluteUrl})`;
|
||||
}
|
||||
}
|
||||
|
||||
return text;
|
||||
}
|
||||
|
||||
async convertImage(element, context) {
|
||||
if (!this.options.includeImages || this.options.textOnly) {
|
||||
// In text-only mode, return alt text if available
|
||||
if (this.options.textOnly) {
|
||||
const alt = element.getAttribute('alt');
|
||||
return alt ? `[Image: ${alt}]` : '';
|
||||
}
|
||||
return '';
|
||||
}
|
||||
|
||||
const src = element.getAttribute('src');
|
||||
const alt = element.getAttribute('alt') || '';
|
||||
const title = element.getAttribute('title');
|
||||
|
||||
if (!src) {
|
||||
return '';
|
||||
}
|
||||
|
||||
// Convert relative URLs to absolute
|
||||
const absoluteUrl = this.makeAbsoluteUrl(src);
|
||||
|
||||
if (title) {
|
||||
return ``;
|
||||
} else {
|
||||
return ``;
|
||||
}
|
||||
}
|
||||
|
||||
async convertList(element, type, context) {
|
||||
const oldDepth = context.listDepth;
|
||||
context.listDepth++;
|
||||
|
||||
const items = [];
|
||||
for (const child of element.children) {
|
||||
if (child.tagName === 'LI') {
|
||||
const markdown = await this.convertListItem(child, { ...context, listType: type });
|
||||
if (markdown) {
|
||||
items.push(markdown);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
context.listDepth = oldDepth;
|
||||
|
||||
return items.join('\n') + (context.listDepth === 0 ? '\n' : '');
|
||||
}
|
||||
|
||||
async convertListItem(element, context) {
|
||||
const indent = ' '.repeat(Math.max(0, context.listDepth - 1));
|
||||
const bullet = context.listType === 'ol' ? '1.' : '-';
|
||||
const content = (await this.processChildren(element, context)).trim();
|
||||
|
||||
return `${indent}${bullet} ${content}`;
|
||||
}
|
||||
|
||||
async convertTable(element, context) {
|
||||
if (!this.options.preserveTables || this.options.textOnly) {
|
||||
// Fallback to simple text representation
|
||||
return await this.convertTableToText(element, context);
|
||||
}
|
||||
|
||||
const rows = [];
|
||||
const headerRows = [];
|
||||
let maxCols = 0;
|
||||
|
||||
// Process table rows
|
||||
for (const child of element.children) {
|
||||
if (child.tagName === 'THEAD') {
|
||||
for (const row of child.children) {
|
||||
if (row.tagName === 'TR') {
|
||||
const cells = await this.processTableRow(row, context);
|
||||
headerRows.push(cells);
|
||||
maxCols = Math.max(maxCols, cells.length);
|
||||
}
|
||||
}
|
||||
} else if (child.tagName === 'TBODY') {
|
||||
for (const row of child.children) {
|
||||
if (row.tagName === 'TR') {
|
||||
const cells = await this.processTableRow(row, context);
|
||||
rows.push(cells);
|
||||
maxCols = Math.max(maxCols, cells.length);
|
||||
}
|
||||
}
|
||||
} else if (child.tagName === 'TR') {
|
||||
const cells = await this.processTableRow(child, context);
|
||||
rows.push(cells);
|
||||
maxCols = Math.max(maxCols, cells.length);
|
||||
}
|
||||
}
|
||||
|
||||
// Build markdown table
|
||||
const markdownRows = [];
|
||||
|
||||
// Add headers
|
||||
if (headerRows.length > 0) {
|
||||
for (const headerRow of headerRows) {
|
||||
const paddedRow = this.padTableRow(headerRow, maxCols);
|
||||
markdownRows.push('| ' + paddedRow.join(' | ') + ' |');
|
||||
}
|
||||
|
||||
// Add separator
|
||||
const separator = Array(maxCols).fill('---');
|
||||
markdownRows.push('| ' + separator.join(' | ') + ' |');
|
||||
}
|
||||
|
||||
// Add body rows
|
||||
for (const row of rows) {
|
||||
const paddedRow = this.padTableRow(row, maxCols);
|
||||
markdownRows.push('| ' + paddedRow.join(' | ') + ' |');
|
||||
}
|
||||
|
||||
return markdownRows.join('\n') + '\n';
|
||||
}
|
||||
|
||||
async processTableRow(row, context) {
|
||||
const cells = [];
|
||||
|
||||
for (const cell of row.children) {
|
||||
if (cell.tagName === 'TD' || cell.tagName === 'TH') {
|
||||
const content = (await this.getTextContent(cell, context)).trim();
|
||||
cells.push(content);
|
||||
}
|
||||
}
|
||||
|
||||
return cells;
|
||||
}
|
||||
|
||||
async convertTableRow(element, context) {
|
||||
// Convert a single table row to markdown
|
||||
if (this.options.textOnly) {
|
||||
const cells = await this.processTableRow(element, context);
|
||||
return cells.join(' ');
|
||||
}
|
||||
|
||||
// For non-text-only mode, create a simple table representation
|
||||
const cells = await this.processTableRow(element, context);
|
||||
return '| ' + cells.join(' | ') + ' |';
|
||||
}
|
||||
|
||||
padTableRow(row, targetLength) {
|
||||
const padded = [...row];
|
||||
while (padded.length < targetLength) {
|
||||
padded.push('');
|
||||
}
|
||||
return padded;
|
||||
}
|
||||
|
||||
async convertTableToText(element, context) {
|
||||
// Convert table to clean text representation
|
||||
const lines = [];
|
||||
const rows = element.querySelectorAll('tr');
|
||||
|
||||
for (const row of rows) {
|
||||
const cells = row.querySelectorAll('td, th');
|
||||
const cellTexts = [];
|
||||
|
||||
for (const cell of cells) {
|
||||
const text = (await this.getTextContent(cell, context)).trim();
|
||||
if (text) {
|
||||
cellTexts.push(text);
|
||||
}
|
||||
}
|
||||
|
||||
if (cellTexts.length > 0) {
|
||||
// Join cells with space, handling common patterns
|
||||
lines.push(cellTexts.join(' '));
|
||||
}
|
||||
}
|
||||
|
||||
return lines.join('\n');
|
||||
}
|
||||
|
||||
async convertBlockquote(element, context) {
|
||||
const lines = (await this.processChildren(element, context)).trim().split('\n');
|
||||
return lines.map(line => '> ' + line).join('\n') + '\n';
|
||||
}
|
||||
|
||||
async convertPreformatted(element, context) {
|
||||
const oldInCode = context.inCode;
|
||||
const oldPreserveWhitespace = context.preserveWhitespace;
|
||||
|
||||
context.inCode = true;
|
||||
context.preserveWhitespace = true;
|
||||
|
||||
let content = '';
|
||||
let language = '';
|
||||
|
||||
// Check if this is a code block with language
|
||||
const codeElement = element.querySelector('code');
|
||||
if (codeElement) {
|
||||
// Try to detect language from class
|
||||
const className = codeElement.className;
|
||||
const langMatch = className.match(/language-(\w+)/);
|
||||
if (langMatch) {
|
||||
language = langMatch[1];
|
||||
}
|
||||
|
||||
content = codeElement.textContent;
|
||||
} else {
|
||||
content = element.textContent;
|
||||
}
|
||||
|
||||
context.inCode = oldInCode;
|
||||
context.preserveWhitespace = oldPreserveWhitespace;
|
||||
|
||||
// Use fenced code blocks
|
||||
return '```' + language + '\n' + content + '\n```\n';
|
||||
}
|
||||
|
||||
async convertCode(element, context) {
|
||||
if (element.parentElement && element.parentElement.tagName === 'PRE') {
|
||||
// Already handled by convertPreformatted
|
||||
return element.textContent;
|
||||
}
|
||||
|
||||
const content = element.textContent;
|
||||
return '`' + content + '`';
|
||||
}
|
||||
|
||||
async convertDiv(element, context) {
|
||||
// Check for special div types
|
||||
if (element.className.includes('code-block') ||
|
||||
element.className.includes('highlight')) {
|
||||
return await this.convertPreformatted(element, context);
|
||||
}
|
||||
|
||||
const content = await this.processChildren(element, context);
|
||||
return content.trim() ? content + '\n' : '';
|
||||
}
|
||||
|
||||
async convertSpan(element, context) {
|
||||
// Check for special span types
|
||||
if (element.className.includes('code') ||
|
||||
element.className.includes('inline-code')) {
|
||||
return this.convertCode(element, context);
|
||||
}
|
||||
|
||||
return await this.processChildren(element, context);
|
||||
}
|
||||
|
||||
async convertArticle(element, context) {
|
||||
const content = await this.processChildren(element, context);
|
||||
return content.trim() ? content + '\n' : '';
|
||||
}
|
||||
|
||||
async convertSection(element, context) {
|
||||
const content = await this.processChildren(element, context);
|
||||
return content.trim() ? content + '\n' : '';
|
||||
}
|
||||
|
||||
async convertFigure(element, context) {
|
||||
const content = await this.processChildren(element, context);
|
||||
return content.trim() ? content + '\n' : '';
|
||||
}
|
||||
|
||||
async convertFigCaption(element, context) {
|
||||
const caption = await this.getTextContent(element, context);
|
||||
return caption ? '\n*' + caption + '*\n' : '';
|
||||
}
|
||||
|
||||
async convertVideo(element, context) {
|
||||
const title = element.getAttribute('title') || 'Video';
|
||||
|
||||
if (this.options.textOnly) {
|
||||
return `[Video: ${title}]`;
|
||||
}
|
||||
|
||||
const src = element.getAttribute('src');
|
||||
const poster = element.getAttribute('poster');
|
||||
|
||||
if (!src) {
|
||||
return '';
|
||||
}
|
||||
|
||||
// Convert to markdown with poster image if available
|
||||
if (poster) {
|
||||
const absolutePoster = this.makeAbsoluteUrl(poster);
|
||||
const absoluteSrc = this.makeAbsoluteUrl(src);
|
||||
return `[](${absoluteSrc})`;
|
||||
} else {
|
||||
const absoluteSrc = this.makeAbsoluteUrl(src);
|
||||
return `[${title}](${absoluteSrc})`;
|
||||
}
|
||||
}
|
||||
|
||||
async convertIframe(element, context) {
|
||||
const title = element.getAttribute('title') || 'Embedded content';
|
||||
|
||||
if (this.options.textOnly) {
|
||||
const src = element.getAttribute('src') || '';
|
||||
if (src.includes('youtube.com') || src.includes('youtu.be')) {
|
||||
return `[Video: ${title}]`;
|
||||
} else if (src.includes('vimeo.com')) {
|
||||
return `[Video: ${title}]`;
|
||||
} else {
|
||||
return `[Embedded: ${title}]`;
|
||||
}
|
||||
}
|
||||
|
||||
const src = element.getAttribute('src');
|
||||
if (!src) {
|
||||
return '';
|
||||
}
|
||||
|
||||
// Check for common embeds
|
||||
if (src.includes('youtube.com') || src.includes('youtu.be')) {
|
||||
return `[▶️ ${title}](${src})`;
|
||||
} else if (src.includes('vimeo.com')) {
|
||||
return `[▶️ ${title}](${src})`;
|
||||
} else {
|
||||
return `[${title}](${src})`;
|
||||
}
|
||||
}
|
||||
|
||||
async convertDefinitionList(element, context) {
|
||||
return await this.processChildren(element, context) + '\n';
|
||||
}
|
||||
|
||||
async convertDefinitionTerm(element, context) {
|
||||
const term = await this.getTextContent(element, context);
|
||||
return '**' + term + '**\n';
|
||||
}
|
||||
|
||||
async convertDefinitionDescription(element, context) {
|
||||
const description = await this.processChildren(element, context);
|
||||
return ': ' + description + '\n';
|
||||
}
|
||||
|
||||
async getTextContent(element, context) {
|
||||
// Special handling for elements that might contain other markdown
|
||||
if (context.inCode) {
|
||||
return element.textContent;
|
||||
}
|
||||
|
||||
return await this.processChildren(element, context);
|
||||
}
|
||||
|
||||
makeAbsoluteUrl(url) {
|
||||
if (!url) return '';
|
||||
|
||||
try {
|
||||
// Check if already absolute
|
||||
if (url.startsWith('http://') || url.startsWith('https://')) {
|
||||
return url;
|
||||
}
|
||||
|
||||
// Handle protocol-relative URLs
|
||||
if (url.startsWith('//')) {
|
||||
return window.location.protocol + url;
|
||||
}
|
||||
|
||||
// Convert relative to absolute
|
||||
const base = window.location.origin;
|
||||
const path = window.location.pathname;
|
||||
|
||||
if (url.startsWith('/')) {
|
||||
return base + url;
|
||||
} else {
|
||||
// Relative to current path
|
||||
const pathDir = path.substring(0, path.lastIndexOf('/') + 1);
|
||||
return base + pathDir + url;
|
||||
}
|
||||
} catch (e) {
|
||||
return url;
|
||||
}
|
||||
}
|
||||
|
||||
isHidden(element) {
|
||||
const style = window.getComputedStyle(element);
|
||||
return style.display === 'none' ||
|
||||
style.visibility === 'hidden' ||
|
||||
style.opacity === '0';
|
||||
}
|
||||
|
||||
generateReferences() {
|
||||
return this.conversionContext.references
|
||||
.map((ref, index) => `[${index + 1}]: ${ref.url}`)
|
||||
.join('\n');
|
||||
}
|
||||
|
||||
postProcess(markdown) {
|
||||
// Apply text-only specific processing
|
||||
if (this.options.textOnly) {
|
||||
markdown = this.postProcessTextOnly(markdown);
|
||||
}
|
||||
|
||||
// Clean up excessive newlines
|
||||
markdown = markdown.replace(/\n{3,}/g, '\n\n');
|
||||
|
||||
// Clean up spaces before punctuation
|
||||
markdown = markdown.replace(/ +([.,;:!?])/g, '$1');
|
||||
|
||||
// Ensure proper spacing around headers
|
||||
markdown = markdown.replace(/\n(#{1,6} )/g, '\n\n$1');
|
||||
markdown = markdown.replace(/(#{1,6} .+)\n(?![\n#])/g, '$1\n\n');
|
||||
|
||||
// Clean up list spacing
|
||||
markdown = markdown.replace(/\n\n(-|\d+\.) /g, '\n$1 ');
|
||||
|
||||
// Trim final result
|
||||
return markdown.trim();
|
||||
}
|
||||
|
||||
postProcessTextOnly(markdown) {
|
||||
// Smart pattern recognition for common formats
|
||||
const lines = markdown.split('\n');
|
||||
const processedLines = [];
|
||||
let inMetadata = false;
|
||||
let currentItem = null;
|
||||
|
||||
for (let i = 0; i < lines.length; i++) {
|
||||
const line = lines[i].trim();
|
||||
if (!line) {
|
||||
processedLines.push('');
|
||||
continue;
|
||||
}
|
||||
|
||||
// Detect numbered list items (common in HN, Reddit, etc.)
|
||||
const numberPattern = /^(\d+)\.\s*(.+)$/;
|
||||
const numberMatch = line.match(numberPattern);
|
||||
|
||||
if (numberMatch) {
|
||||
// Start of a new numbered item
|
||||
inMetadata = false;
|
||||
currentItem = numberMatch[1];
|
||||
const content = numberMatch[2];
|
||||
|
||||
// Check if content has domain in parentheses
|
||||
const domainPattern = /^(.+?)\s*\(([^)]+)\)\s*(.*)$/;
|
||||
const domainMatch = content.match(domainPattern);
|
||||
|
||||
if (domainMatch) {
|
||||
const [, title, domain, rest] = domainMatch;
|
||||
processedLines.push(`${currentItem}. **${title.trim()}** (${domain})`);
|
||||
if (rest.trim()) {
|
||||
processedLines.push(` ${rest.trim()}`);
|
||||
inMetadata = true;
|
||||
}
|
||||
} else {
|
||||
processedLines.push(`${currentItem}. **${content}**`);
|
||||
}
|
||||
} else if (line.match(/\b(points?|by|ago|hide|comments?)\b/i) && currentItem) {
|
||||
// This looks like metadata for the current item
|
||||
const cleanedLine = line
|
||||
.replace(/\s+/g, ' ')
|
||||
.replace(/\s*\|\s*/g, ' | ')
|
||||
.trim();
|
||||
processedLines.push(` ${cleanedLine}`);
|
||||
inMetadata = true;
|
||||
} else if (inMetadata && line.length < 100) {
|
||||
// Continue metadata if we're in metadata mode and line is short
|
||||
processedLines.push(` ${line}`);
|
||||
} else {
|
||||
// Regular content
|
||||
inMetadata = false;
|
||||
processedLines.push(line);
|
||||
}
|
||||
}
|
||||
|
||||
// Clean up the output
|
||||
let result = processedLines.join('\n');
|
||||
|
||||
// Remove excessive blank lines
|
||||
result = result.replace(/\n{3,}/g, '\n\n');
|
||||
|
||||
// Ensure proper spacing after numbered items
|
||||
result = result.replace(/^(\d+\..+)$\n^(?!\s)/gm, '$1\n\n');
|
||||
|
||||
return result;
|
||||
}
|
||||
}
|
||||
701
docs/md_v2/apps/crawl4ai-assistant/content/markdownExtraction.js
Normal file
@@ -0,0 +1,701 @@
|
||||
class MarkdownExtraction {
|
||||
constructor() {
|
||||
this.selectedElements = new Set();
|
||||
this.highlightBoxes = new Map();
|
||||
this.selectionMode = false;
|
||||
this.toolbar = null;
|
||||
this.markdownPreviewModal = null;
|
||||
this.selectionCounter = 0;
|
||||
this.markdownConverter = null;
|
||||
this.contentAnalyzer = null;
|
||||
|
||||
this.init();
|
||||
}
|
||||
|
||||
async init() {
|
||||
// Initialize dependencies
|
||||
this.markdownConverter = new MarkdownConverter();
|
||||
this.contentAnalyzer = new ContentAnalyzer();
|
||||
|
||||
this.createToolbar();
|
||||
this.setupEventListeners();
|
||||
}
|
||||
|
||||
createToolbar() {
|
||||
// Create floating toolbar
|
||||
this.toolbar = document.createElement('div');
|
||||
this.toolbar.className = 'c4ai-c2c-toolbar';
|
||||
this.toolbar.innerHTML = `
|
||||
<div class="c4ai-toolbar-header">
|
||||
<div class="c4ai-toolbar-dots">
|
||||
<span class="c4ai-dot c4ai-dot-red"></span>
|
||||
<span class="c4ai-dot c4ai-dot-yellow"></span>
|
||||
<span class="c4ai-dot c4ai-dot-green"></span>
|
||||
</div>
|
||||
<span class="c4ai-toolbar-title">Markdown Extraction</span>
|
||||
<button class="c4ai-close-btn" title="Close">×</button>
|
||||
</div>
|
||||
<div class="c4ai-toolbar-content">
|
||||
<div class="c4ai-selection-info">
|
||||
<span class="c4ai-selection-count">0 elements selected</span>
|
||||
<button class="c4ai-clear-btn" title="Clear selection" disabled>Clear</button>
|
||||
</div>
|
||||
<div class="c4ai-toolbar-actions">
|
||||
<button class="c4ai-preview-btn" disabled>Preview Markdown</button>
|
||||
<button class="c4ai-copy-btn" disabled>Copy to Clipboard</button>
|
||||
</div>
|
||||
<div class="c4ai-toolbar-instructions">
|
||||
<p>💡 <strong>Ctrl/Cmd + Click</strong> to select multiple elements</p>
|
||||
<p>📝 Selected elements will be converted to clean markdown</p>
|
||||
<p>⌨️ Press <strong>ESC</strong> to exit</p>
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
|
||||
document.body.appendChild(this.toolbar);
|
||||
makeDraggableByHeader(this.toolbar);
|
||||
|
||||
// Position toolbar
|
||||
this.toolbar.style.position = 'fixed';
|
||||
this.toolbar.style.top = '20px';
|
||||
this.toolbar.style.right = '20px';
|
||||
this.toolbar.style.zIndex = '999999';
|
||||
}
|
||||
|
||||
setupEventListeners() {
|
||||
// Close button
|
||||
this.toolbar.querySelector('.c4ai-close-btn').addEventListener('click', () => {
|
||||
this.deactivate();
|
||||
});
|
||||
|
||||
// Clear selection button
|
||||
this.toolbar.querySelector('.c4ai-clear-btn').addEventListener('click', () => {
|
||||
this.clearSelection();
|
||||
});
|
||||
|
||||
// Preview button
|
||||
this.toolbar.querySelector('.c4ai-preview-btn').addEventListener('click', () => {
|
||||
this.showPreview();
|
||||
});
|
||||
|
||||
// Copy button
|
||||
this.toolbar.querySelector('.c4ai-copy-btn').addEventListener('click', () => {
|
||||
this.copyToClipboard();
|
||||
});
|
||||
|
||||
// Document click handler for element selection
|
||||
this.documentClickHandler = (event) => this.handleElementClick(event);
|
||||
document.addEventListener('click', this.documentClickHandler, true);
|
||||
|
||||
// Prevent default link behavior during selection mode
|
||||
this.linkClickHandler = (event) => {
|
||||
if (event.ctrlKey || event.metaKey) {
|
||||
event.preventDefault();
|
||||
event.stopPropagation();
|
||||
}
|
||||
};
|
||||
document.addEventListener('click', this.linkClickHandler, true);
|
||||
|
||||
// Hover effect
|
||||
this.documentHoverHandler = (event) => this.handleElementHover(event);
|
||||
document.addEventListener('mouseover', this.documentHoverHandler, true);
|
||||
|
||||
// Remove hover on mouseout
|
||||
this.documentMouseOutHandler = (event) => this.handleElementMouseOut(event);
|
||||
document.addEventListener('mouseout', this.documentMouseOutHandler, true);
|
||||
|
||||
// Keyboard shortcuts
|
||||
this.keyboardHandler = (event) => this.handleKeyboard(event);
|
||||
document.addEventListener('keydown', this.keyboardHandler);
|
||||
}
|
||||
|
||||
handleElementClick(event) {
|
||||
// Check if Ctrl/Cmd is pressed
|
||||
if (!event.ctrlKey && !event.metaKey) return;
|
||||
|
||||
// Prevent default behavior
|
||||
event.preventDefault();
|
||||
event.stopPropagation();
|
||||
|
||||
const element = event.target;
|
||||
|
||||
// Don't select our own UI elements
|
||||
if (element.closest('.c4ai-c2c-toolbar') ||
|
||||
element.closest('.c4ai-c2c-preview') ||
|
||||
element.closest('.c4ai-highlight-box')) {
|
||||
return;
|
||||
}
|
||||
|
||||
// Toggle element selection
|
||||
if (this.selectedElements.has(element)) {
|
||||
this.deselectElement(element);
|
||||
} else {
|
||||
this.selectElement(element);
|
||||
}
|
||||
|
||||
this.updateUI();
|
||||
}
|
||||
|
||||
handleElementHover(event) {
|
||||
const element = event.target;
|
||||
|
||||
// Don't hover our own UI elements
|
||||
if (element.closest('.c4ai-c2c-toolbar') ||
|
||||
element.closest('.c4ai-c2c-preview') ||
|
||||
element.closest('.c4ai-highlight-box') ||
|
||||
element.hasAttribute('data-c4ai-badge')) {
|
||||
return;
|
||||
}
|
||||
|
||||
// Add hover class
|
||||
element.classList.add('c4ai-hover-candidate');
|
||||
}
|
||||
|
||||
handleElementMouseOut(event) {
|
||||
const element = event.target;
|
||||
element.classList.remove('c4ai-hover-candidate');
|
||||
}
|
||||
|
||||
handleKeyboard(event) {
|
||||
// ESC to deactivate
|
||||
if (event.key === 'Escape') {
|
||||
this.deactivate();
|
||||
}
|
||||
// Ctrl/Cmd + A to select all visible elements
|
||||
else if ((event.ctrlKey || event.metaKey) && event.key === 'a') {
|
||||
event.preventDefault();
|
||||
// Select all visible text-containing elements
|
||||
const elements = document.querySelectorAll('p, h1, h2, h3, h4, h5, h6, li, td, th, div, span, article, section');
|
||||
elements.forEach(el => {
|
||||
if (el.textContent.trim() && this.isVisible(el) && !this.selectedElements.has(el)) {
|
||||
this.selectElement(el);
|
||||
}
|
||||
});
|
||||
this.updateUI();
|
||||
}
|
||||
}
|
||||
|
||||
isVisible(element) {
|
||||
const rect = element.getBoundingClientRect();
|
||||
const style = window.getComputedStyle(element);
|
||||
return rect.width > 0 &&
|
||||
rect.height > 0 &&
|
||||
style.display !== 'none' &&
|
||||
style.visibility !== 'hidden' &&
|
||||
style.opacity !== '0';
|
||||
}
|
||||
|
||||
selectElement(element) {
|
||||
this.selectedElements.add(element);
|
||||
|
||||
// Create highlight box
|
||||
const box = this.createHighlightBox(element);
|
||||
this.highlightBoxes.set(element, box);
|
||||
|
||||
// Add selected class
|
||||
element.classList.add('c4ai-selected');
|
||||
|
||||
this.selectionCounter++;
|
||||
}
|
||||
|
||||
deselectElement(element) {
|
||||
this.selectedElements.delete(element);
|
||||
|
||||
// Remove highlight box (badge)
|
||||
const badge = this.highlightBoxes.get(element);
|
||||
if (badge) {
|
||||
// Remove scroll/resize listeners
|
||||
if (badge._updatePosition) {
|
||||
window.removeEventListener('scroll', badge._updatePosition, true);
|
||||
window.removeEventListener('resize', badge._updatePosition);
|
||||
}
|
||||
badge.remove();
|
||||
this.highlightBoxes.delete(element);
|
||||
}
|
||||
|
||||
// Remove outline
|
||||
element.style.outline = '';
|
||||
element.style.outlineOffset = '';
|
||||
|
||||
// Remove attributes
|
||||
element.removeAttribute('data-c4ai-selection-order');
|
||||
element.classList.remove('c4ai-selected');
|
||||
|
||||
this.selectionCounter--;
|
||||
}
|
||||
|
||||
createHighlightBox(element) {
|
||||
// Add a data attribute to track selection order
|
||||
element.setAttribute('data-c4ai-selection-order', this.selectionCounter + 1);
|
||||
|
||||
// Add selection outline directly to the element
|
||||
element.style.outline = '2px solid #0fbbaa';
|
||||
element.style.outlineOffset = '2px';
|
||||
|
||||
// Create badge with fixed positioning
|
||||
const badge = document.createElement('div');
|
||||
badge.className = 'c4ai-selection-badge-fixed';
|
||||
badge.textContent = this.selectionCounter + 1;
|
||||
badge.setAttribute('data-c4ai-badge', 'true');
|
||||
badge.title = 'Click to deselect';
|
||||
|
||||
// Get element position and set badge position
|
||||
const rect = element.getBoundingClientRect();
|
||||
badge.style.cssText = `
|
||||
position: fixed !important;
|
||||
top: ${rect.top - 12}px !important;
|
||||
left: ${rect.left - 12}px !important;
|
||||
width: 24px !important;
|
||||
height: 24px !important;
|
||||
background: #0fbbaa !important;
|
||||
color: #070708 !important;
|
||||
border-radius: 50% !important;
|
||||
display: flex !important;
|
||||
align-items: center !important;
|
||||
justify-content: center !important;
|
||||
font-size: 12px !important;
|
||||
font-weight: bold !important;
|
||||
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif !important;
|
||||
box-shadow: 0 2px 8px rgba(0, 0, 0, 0.3) !important;
|
||||
z-index: 999998 !important;
|
||||
cursor: pointer !important;
|
||||
transition: all 0.2s ease !important;
|
||||
pointer-events: auto !important;
|
||||
border: none !important;
|
||||
padding: 0 !important;
|
||||
margin: 0 !important;
|
||||
line-height: 1 !important;
|
||||
text-align: center !important;
|
||||
text-decoration: none !important;
|
||||
box-sizing: border-box !important;
|
||||
`;
|
||||
|
||||
// Add hover styles dynamically
|
||||
badge.addEventListener('mouseenter', () => {
|
||||
badge.style.setProperty('background', '#ff3c74', 'important');
|
||||
badge.style.setProperty('transform', 'scale(1.1)', 'important');
|
||||
});
|
||||
|
||||
badge.addEventListener('mouseleave', () => {
|
||||
badge.style.setProperty('background', '#0fbbaa', 'important');
|
||||
badge.style.setProperty('transform', 'scale(1)', 'important');
|
||||
});
|
||||
|
||||
// Add click handler to badge for deselection
|
||||
badge.addEventListener('click', (e) => {
|
||||
e.stopPropagation();
|
||||
e.preventDefault();
|
||||
this.deselectElement(element);
|
||||
this.updateUI();
|
||||
});
|
||||
|
||||
// Add scroll listener to update position
|
||||
const updatePosition = () => {
|
||||
const newRect = element.getBoundingClientRect();
|
||||
badge.style.top = `${newRect.top - 12}px`;
|
||||
badge.style.left = `${newRect.left - 12}px`;
|
||||
};
|
||||
|
||||
// Store the update function so we can remove it later
|
||||
badge._updatePosition = updatePosition;
|
||||
window.addEventListener('scroll', updatePosition, true);
|
||||
window.addEventListener('resize', updatePosition);
|
||||
|
||||
document.body.appendChild(badge);
|
||||
|
||||
return badge;
|
||||
}
|
||||
|
||||
clearSelection() {
|
||||
// Clear all selections
|
||||
this.selectedElements.forEach(element => {
|
||||
// Remove badge
|
||||
const badge = this.highlightBoxes.get(element);
|
||||
if (badge) {
|
||||
// Remove scroll/resize listeners
|
||||
if (badge._updatePosition) {
|
||||
window.removeEventListener('scroll', badge._updatePosition, true);
|
||||
window.removeEventListener('resize', badge._updatePosition);
|
||||
}
|
||||
badge.remove();
|
||||
}
|
||||
|
||||
// Remove outline
|
||||
element.style.outline = '';
|
||||
element.style.outlineOffset = '';
|
||||
|
||||
// Remove attributes
|
||||
element.removeAttribute('data-c4ai-selection-order');
|
||||
element.classList.remove('c4ai-selected');
|
||||
});
|
||||
|
||||
this.selectedElements.clear();
|
||||
this.highlightBoxes.clear();
|
||||
this.selectionCounter = 0;
|
||||
|
||||
this.updateUI();
|
||||
}
|
||||
|
||||
updateUI() {
|
||||
const count = this.selectedElements.size;
|
||||
|
||||
// Update selection count
|
||||
this.toolbar.querySelector('.c4ai-selection-count').textContent =
|
||||
`${count} element${count !== 1 ? 's' : ''} selected`;
|
||||
|
||||
// Enable/disable buttons
|
||||
const hasSelection = count > 0;
|
||||
this.toolbar.querySelector('.c4ai-preview-btn').disabled = !hasSelection;
|
||||
this.toolbar.querySelector('.c4ai-copy-btn').disabled = !hasSelection;
|
||||
this.toolbar.querySelector('.c4ai-clear-btn').disabled = !hasSelection;
|
||||
}
|
||||
|
||||
async showPreview() {
|
||||
// Initialize markdown preview modal if not already done
|
||||
if (!this.markdownPreviewModal) {
|
||||
this.markdownPreviewModal = new MarkdownPreviewModal();
|
||||
}
|
||||
|
||||
// Show modal with callback to generate markdown
|
||||
this.markdownPreviewModal.show(async (options) => {
|
||||
return await this.generateMarkdown(options);
|
||||
});
|
||||
}
|
||||
|
||||
/* createPreviewPanel() {
|
||||
this.previewPanel = document.createElement('div');
|
||||
this.previewPanel.className = 'c4ai-c2c-preview';
|
||||
this.previewPanel.innerHTML = `
|
||||
<div class="c4ai-preview-header">
|
||||
<div class="c4ai-toolbar-dots">
|
||||
<span class="c4ai-dot c4ai-dot-red"></span>
|
||||
<span class="c4ai-dot c4ai-dot-yellow"></span>
|
||||
<span class="c4ai-dot c4ai-dot-green"></span>
|
||||
</div>
|
||||
<span class="c4ai-preview-title">Markdown Preview</span>
|
||||
<button class="c4ai-preview-close">×</button>
|
||||
</div>
|
||||
<div class="c4ai-preview-options">
|
||||
<label><input type="checkbox" name="textOnly"> 👁️ Visual Text Mode (As You See) TRY THIS!!!</label>
|
||||
<label><input type="checkbox" name="includeImages" checked> Include Images</label>
|
||||
<label><input type="checkbox" name="preserveTables" checked> Preserve Tables</label>
|
||||
<label><input type="checkbox" name="preserveLinks" checked> Preserve Links</label>
|
||||
<label><input type="checkbox" name="keepCodeFormatting" checked> Keep Code Formatting</label>
|
||||
<label><input type="checkbox" name="simplifyLayout"> Simplify Layout</label>
|
||||
<label><input type="checkbox" name="addSeparators" checked> Add Separators</label>
|
||||
<label><input type="checkbox" name="includeXPath"> Include XPath Headers</label>
|
||||
</div>
|
||||
<div class="c4ai-preview-content">
|
||||
<div class="c4ai-preview-tabs">
|
||||
<button class="c4ai-tab active" data-tab="preview">Preview</button>
|
||||
<button class="c4ai-tab" data-tab="markdown">Markdown</button>
|
||||
<button class="c4ai-wrap-toggle" title="Toggle word wrap">↔️ Wrap</button>
|
||||
</div>
|
||||
<div class="c4ai-preview-pane active" data-pane="preview"></div>
|
||||
<div class="c4ai-preview-pane" data-pane="markdown"></div>
|
||||
</div>
|
||||
<div class="c4ai-preview-actions">
|
||||
<button class="c4ai-download-btn">Download .md</button>
|
||||
<button class="c4ai-copy-markdown-btn">Copy Markdown</button>
|
||||
<button class="c4ai-cloud-btn" disabled>Send to Cloud (Coming Soon)</button>
|
||||
</div>
|
||||
`;
|
||||
|
||||
document.body.appendChild(this.previewPanel);
|
||||
makeDraggableByHeader(this.previewPanel);
|
||||
|
||||
// Position preview panel
|
||||
this.previewPanel.style.position = 'fixed';
|
||||
this.previewPanel.style.top = '50%';
|
||||
this.previewPanel.style.left = '50%';
|
||||
this.previewPanel.style.transform = 'translate(-50%, -50%)';
|
||||
this.previewPanel.style.zIndex = '999999';
|
||||
|
||||
this.setupPreviewEventListeners();
|
||||
} */
|
||||
|
||||
/* setupPreviewEventListeners() {
|
||||
// Close button
|
||||
this.previewPanel.querySelector('.c4ai-preview-close').addEventListener('click', () => {
|
||||
this.previewPanel.style.display = 'none';
|
||||
});
|
||||
|
||||
// Tab switching
|
||||
this.previewPanel.querySelectorAll('.c4ai-tab').forEach(tab => {
|
||||
tab.addEventListener('click', (e) => {
|
||||
const tabName = e.target.dataset.tab;
|
||||
this.switchPreviewTab(tabName);
|
||||
});
|
||||
});
|
||||
|
||||
// Wrap toggle
|
||||
const wrapToggle = this.previewPanel.querySelector('.c4ai-wrap-toggle');
|
||||
wrapToggle.addEventListener('click', () => {
|
||||
const panes = this.previewPanel.querySelectorAll('.c4ai-preview-pane');
|
||||
panes.forEach(pane => {
|
||||
pane.classList.toggle('wrap');
|
||||
});
|
||||
wrapToggle.classList.toggle('active');
|
||||
});
|
||||
|
||||
// Options change
|
||||
this.previewPanel.querySelectorAll('input[type="checkbox"]').forEach(checkbox => {
|
||||
checkbox.addEventListener('change', async (e) => {
|
||||
this.options[e.target.name] = e.target.checked;
|
||||
|
||||
// If text-only is enabled, automatically disable certain options
|
||||
if (e.target.name === 'textOnly' && e.target.checked) {
|
||||
// Update UI checkboxes
|
||||
const preserveLinksCheckbox = this.previewPanel.querySelector('input[name="preserveLinks"]');
|
||||
if (preserveLinksCheckbox) {
|
||||
preserveLinksCheckbox.checked = false;
|
||||
preserveLinksCheckbox.disabled = true;
|
||||
}
|
||||
|
||||
// Optionally disable images in text-only mode
|
||||
const includeImagesCheckbox = this.previewPanel.querySelector('input[name="includeImages"]');
|
||||
if (includeImagesCheckbox) {
|
||||
includeImagesCheckbox.disabled = true;
|
||||
}
|
||||
} else if (e.target.name === 'textOnly' && !e.target.checked) {
|
||||
// Re-enable options when text-only is disabled
|
||||
const preserveLinksCheckbox = this.previewPanel.querySelector('input[name="preserveLinks"]');
|
||||
if (preserveLinksCheckbox) {
|
||||
preserveLinksCheckbox.disabled = false;
|
||||
}
|
||||
|
||||
const includeImagesCheckbox = this.previewPanel.querySelector('input[name="includeImages"]');
|
||||
if (includeImagesCheckbox) {
|
||||
includeImagesCheckbox.disabled = false;
|
||||
}
|
||||
}
|
||||
|
||||
const markdown = await this.generateMarkdown();
|
||||
await this.updatePreviewContent(markdown);
|
||||
});
|
||||
});
|
||||
|
||||
// Action buttons
|
||||
this.previewPanel.querySelector('.c4ai-copy-markdown-btn').addEventListener('click', () => {
|
||||
this.copyToClipboard();
|
||||
});
|
||||
|
||||
this.previewPanel.querySelector('.c4ai-download-btn').addEventListener('click', () => {
|
||||
this.downloadMarkdown();
|
||||
});
|
||||
} */
|
||||
|
||||
/* switchPreviewTab(tabName) {
|
||||
// Update active tab
|
||||
this.previewPanel.querySelectorAll('.c4ai-tab').forEach(tab => {
|
||||
tab.classList.toggle('active', tab.dataset.tab === tabName);
|
||||
});
|
||||
|
||||
// Update active pane
|
||||
this.previewPanel.querySelectorAll('.c4ai-preview-pane').forEach(pane => {
|
||||
pane.classList.toggle('active', pane.dataset.pane === tabName);
|
||||
});
|
||||
} */
|
||||
|
||||
/* async updatePreviewContent(markdown) {
|
||||
// Update markdown pane
|
||||
const markdownPane = this.previewPanel.querySelector('[data-pane="markdown"]');
|
||||
markdownPane.innerHTML = `<pre><code>${this.escapeHtml(markdown)}</code></pre>`;
|
||||
|
||||
// Update preview pane using marked.js
|
||||
const previewPane = this.previewPanel.querySelector('[data-pane="preview"]');
|
||||
|
||||
// Configure marked options (marked.js is already loaded via manifest)
|
||||
if (window.marked) {
|
||||
marked.setOptions({
|
||||
gfm: true,
|
||||
breaks: true,
|
||||
tables: true,
|
||||
headerIds: false,
|
||||
mangle: false
|
||||
});
|
||||
|
||||
// Render markdown to HTML
|
||||
const html = marked.parse(markdown);
|
||||
previewPane.innerHTML = `<div class="c4ai-markdown-preview">${html}</div>`;
|
||||
} else {
|
||||
// Fallback if marked.js is not available
|
||||
previewPane.innerHTML = `<div class="c4ai-markdown-preview"><pre>${this.escapeHtml(markdown)}</pre></div>`;
|
||||
}
|
||||
} */
|
||||
|
||||
|
||||
/* escapeHtml(unsafe) {
|
||||
return unsafe
|
||||
.replace(/&/g, "&")
|
||||
.replace(/</g, "<")
|
||||
.replace(/>/g, ">")
|
||||
.replace(/"/g, """)
|
||||
.replace(/'/g, "'");
|
||||
} */
|
||||
|
||||
async generateMarkdown(options) {
|
||||
// Get selected elements as array
|
||||
const elements = Array.from(this.selectedElements);
|
||||
|
||||
// Sort elements by their selection order
|
||||
const sortedElements = elements.sort((a, b) => {
|
||||
const orderA = parseInt(a.getAttribute('data-c4ai-selection-order') || '0');
|
||||
const orderB = parseInt(b.getAttribute('data-c4ai-selection-order') || '0');
|
||||
return orderA - orderB;
|
||||
});
|
||||
|
||||
// Convert each element separately
|
||||
const markdownParts = [];
|
||||
|
||||
for (let i = 0; i < sortedElements.length; i++) {
|
||||
const element = sortedElements[i];
|
||||
|
||||
// Add XPath header if enabled
|
||||
if (options.includeXPath) {
|
||||
const xpath = this.getXPath(element);
|
||||
markdownParts.push(`### Element ${i + 1} - XPath: \`${xpath}\`\n`);
|
||||
}
|
||||
|
||||
// Check if element is part of a table structure that should be processed specially
|
||||
let elementsToConvert = [element];
|
||||
|
||||
// If text-only mode and element is a TR, process the entire table for better context
|
||||
if (options.textOnly && element.tagName === 'TR') {
|
||||
const table = element.closest('table');
|
||||
if (table && !sortedElements.includes(table)) {
|
||||
// Only include this table row, not the whole table
|
||||
elementsToConvert = [element];
|
||||
}
|
||||
}
|
||||
|
||||
// Analyze and convert individual element
|
||||
const analysis = await this.contentAnalyzer.analyze(elementsToConvert);
|
||||
const markdown = await this.markdownConverter.convert(elementsToConvert, {
|
||||
...options,
|
||||
analysis
|
||||
});
|
||||
|
||||
// Trim the markdown before adding
|
||||
const trimmedMarkdown = markdown.trim();
|
||||
markdownParts.push(trimmedMarkdown);
|
||||
|
||||
// Add separator if enabled and not last element
|
||||
if (options.addSeparators && i < sortedElements.length - 1) {
|
||||
markdownParts.push('\n---\n');
|
||||
}
|
||||
}
|
||||
|
||||
return markdownParts.join('\n');
|
||||
}
|
||||
|
||||
getXPath(element) {
|
||||
if (element.id) {
|
||||
return `//*[@id="${element.id}"]`;
|
||||
}
|
||||
|
||||
const parts = [];
|
||||
let current = element;
|
||||
|
||||
while (current && current.nodeType === Node.ELEMENT_NODE) {
|
||||
let index = 0;
|
||||
let sibling = current.previousSibling;
|
||||
|
||||
while (sibling) {
|
||||
if (sibling.nodeType === Node.ELEMENT_NODE && sibling.nodeName === current.nodeName) {
|
||||
index++;
|
||||
}
|
||||
sibling = sibling.previousSibling;
|
||||
}
|
||||
|
||||
const tagName = current.nodeName.toLowerCase();
|
||||
const part = index > 0 ? `${tagName}[${index + 1}]` : tagName;
|
||||
parts.unshift(part);
|
||||
|
||||
current = current.parentNode;
|
||||
}
|
||||
|
||||
return '/' + parts.join('/');
|
||||
}
|
||||
|
||||
sortElementsByPosition(elements) {
|
||||
return elements.sort((a, b) => {
|
||||
const position = a.compareDocumentPosition(b);
|
||||
if (position & Node.DOCUMENT_POSITION_FOLLOWING) {
|
||||
return -1;
|
||||
} else if (position & Node.DOCUMENT_POSITION_PRECEDING) {
|
||||
return 1;
|
||||
}
|
||||
return 0;
|
||||
});
|
||||
}
|
||||
|
||||
async copyToClipboard() {
|
||||
if (this.markdownPreviewModal) {
|
||||
await this.markdownPreviewModal.copyToClipboard();
|
||||
}
|
||||
}
|
||||
|
||||
async downloadMarkdown() {
|
||||
if (this.markdownPreviewModal) {
|
||||
await this.markdownPreviewModal.downloadMarkdown();
|
||||
}
|
||||
}
|
||||
|
||||
showNotification(message, type = 'success') {
|
||||
const notification = document.createElement('div');
|
||||
notification.className = `c4ai-notification c4ai-notification-${type}`;
|
||||
notification.textContent = message;
|
||||
|
||||
document.body.appendChild(notification);
|
||||
|
||||
// Animate in
|
||||
setTimeout(() => notification.classList.add('show'), 10);
|
||||
|
||||
// Remove after 3 seconds
|
||||
setTimeout(() => {
|
||||
notification.classList.remove('show');
|
||||
setTimeout(() => notification.remove(), 300);
|
||||
}, 3000);
|
||||
}
|
||||
|
||||
deactivate() {
|
||||
// Remove event listeners
|
||||
document.removeEventListener('click', this.documentClickHandler, true);
|
||||
document.removeEventListener('click', this.linkClickHandler, true);
|
||||
document.removeEventListener('mouseover', this.documentHoverHandler, true);
|
||||
document.removeEventListener('mouseout', this.documentMouseOutHandler, true);
|
||||
document.removeEventListener('keydown', this.keyboardHandler);
|
||||
|
||||
// Clear selections
|
||||
this.clearSelection();
|
||||
|
||||
// Remove UI elements
|
||||
if (this.toolbar) {
|
||||
this.toolbar.remove();
|
||||
this.toolbar = null;
|
||||
}
|
||||
|
||||
if (this.markdownPreviewModal) {
|
||||
this.markdownPreviewModal.destroy();
|
||||
this.markdownPreviewModal = null;
|
||||
}
|
||||
|
||||
// Remove hover styles
|
||||
document.querySelectorAll('.c4ai-hover-candidate').forEach(el => {
|
||||
el.classList.remove('c4ai-hover-candidate');
|
||||
});
|
||||
|
||||
// Notify background script (with error handling)
|
||||
try {
|
||||
if (chrome.runtime && chrome.runtime.sendMessage) {
|
||||
chrome.runtime.sendMessage({
|
||||
action: 'c2cDeactivated'
|
||||
});
|
||||
}
|
||||
} catch (error) {
|
||||
// Extension context might be invalidated, ignore the error
|
||||
console.log('Markdown Extraction deactivated (extension context unavailable)');
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,300 @@
|
||||
// Shared Markdown Preview Modal Component for Crawl4AI Assistant
|
||||
// Used by both SchemaBuilder and Click2CrawlBuilder
|
||||
|
||||
class MarkdownPreviewModal {
|
||||
constructor(options = {}) {
|
||||
this.modal = null;
|
||||
this.markdownOptions = {
|
||||
includeImages: true,
|
||||
preserveTables: true,
|
||||
keepCodeFormatting: true,
|
||||
simplifyLayout: false,
|
||||
preserveLinks: true,
|
||||
addSeparators: true,
|
||||
includeXPath: false,
|
||||
textOnly: false,
|
||||
...options
|
||||
};
|
||||
this.onGenerateMarkdown = null;
|
||||
this.currentMarkdown = '';
|
||||
}
|
||||
|
||||
show(generateMarkdownCallback) {
|
||||
this.onGenerateMarkdown = generateMarkdownCallback;
|
||||
|
||||
if (!this.modal) {
|
||||
this.createModal();
|
||||
}
|
||||
|
||||
// Generate initial markdown
|
||||
this.updateContent();
|
||||
this.modal.style.display = 'block';
|
||||
}
|
||||
|
||||
hide() {
|
||||
if (this.modal) {
|
||||
this.modal.style.display = 'none';
|
||||
}
|
||||
}
|
||||
|
||||
createModal() {
|
||||
this.modal = document.createElement('div');
|
||||
this.modal.className = 'c4ai-c2c-preview';
|
||||
this.modal.innerHTML = `
|
||||
<div class="c4ai-preview-header">
|
||||
<div class="c4ai-toolbar-dots">
|
||||
<span class="c4ai-dot c4ai-dot-red"></span>
|
||||
<span class="c4ai-dot c4ai-dot-yellow"></span>
|
||||
<span class="c4ai-dot c4ai-dot-green"></span>
|
||||
</div>
|
||||
<span class="c4ai-preview-title">Markdown Preview</span>
|
||||
<button class="c4ai-preview-close">×</button>
|
||||
</div>
|
||||
<div class="c4ai-preview-options">
|
||||
<label><input type="checkbox" name="textOnly"> 👁️ Visual Text Mode (As You See)</label>
|
||||
<label><input type="checkbox" name="includeImages" checked> Include Images</label>
|
||||
<label><input type="checkbox" name="preserveTables" checked> Preserve Tables</label>
|
||||
<label><input type="checkbox" name="preserveLinks" checked> Preserve Links</label>
|
||||
<label><input type="checkbox" name="keepCodeFormatting" checked> Keep Code Formatting</label>
|
||||
<label><input type="checkbox" name="simplifyLayout"> Simplify Layout</label>
|
||||
<label><input type="checkbox" name="addSeparators" checked> Add Separators</label>
|
||||
<label><input type="checkbox" name="includeXPath"> Include XPath Headers</label>
|
||||
</div>
|
||||
<div class="c4ai-preview-content">
|
||||
<div class="c4ai-preview-tabs">
|
||||
<button class="c4ai-tab active" data-tab="preview">Preview</button>
|
||||
<button class="c4ai-tab" data-tab="markdown">Markdown</button>
|
||||
<button class="c4ai-wrap-toggle" title="Toggle word wrap">↔️ Wrap</button>
|
||||
</div>
|
||||
<div class="c4ai-preview-pane active" data-pane="preview"></div>
|
||||
<div class="c4ai-preview-pane" data-pane="markdown"></div>
|
||||
</div>
|
||||
<div class="c4ai-preview-actions">
|
||||
<button class="c4ai-download-btn">Download .md</button>
|
||||
<button class="c4ai-copy-markdown-btn">Copy Markdown</button>
|
||||
<button class="c4ai-cloud-btn" disabled>Send to Cloud (Coming Soon)</button>
|
||||
</div>
|
||||
`;
|
||||
|
||||
document.body.appendChild(this.modal);
|
||||
|
||||
// Make modal draggable
|
||||
if (window.C4AI_Utils && window.C4AI_Utils.makeDraggable) {
|
||||
window.C4AI_Utils.makeDraggable(this.modal);
|
||||
}
|
||||
|
||||
// Position preview modal
|
||||
this.modal.style.position = 'fixed';
|
||||
this.modal.style.top = '50%';
|
||||
this.modal.style.left = '50%';
|
||||
this.modal.style.transform = 'translate(-50%, -50%)';
|
||||
this.modal.style.zIndex = '999999';
|
||||
|
||||
this.setupEventListeners();
|
||||
}
|
||||
|
||||
setupEventListeners() {
|
||||
// Close button
|
||||
this.modal.querySelector('.c4ai-preview-close').addEventListener('click', () => {
|
||||
this.hide();
|
||||
});
|
||||
|
||||
// Tab switching
|
||||
this.modal.querySelectorAll('.c4ai-tab').forEach(tab => {
|
||||
tab.addEventListener('click', (e) => {
|
||||
const tabName = e.target.dataset.tab;
|
||||
this.switchTab(tabName);
|
||||
});
|
||||
});
|
||||
|
||||
// Wrap toggle
|
||||
const wrapToggle = this.modal.querySelector('.c4ai-wrap-toggle');
|
||||
wrapToggle.addEventListener('click', () => {
|
||||
const panes = this.modal.querySelectorAll('.c4ai-preview-pane');
|
||||
panes.forEach(pane => {
|
||||
pane.classList.toggle('wrap');
|
||||
});
|
||||
wrapToggle.classList.toggle('active');
|
||||
});
|
||||
|
||||
// Options change
|
||||
this.modal.querySelectorAll('input[type="checkbox"]').forEach(checkbox => {
|
||||
checkbox.addEventListener('change', async (e) => {
|
||||
this.markdownOptions[e.target.name] = e.target.checked;
|
||||
|
||||
// Handle text-only mode dependencies
|
||||
if (e.target.name === 'textOnly' && e.target.checked) {
|
||||
const preserveLinksCheckbox = this.modal.querySelector('input[name="preserveLinks"]');
|
||||
if (preserveLinksCheckbox) {
|
||||
preserveLinksCheckbox.checked = false;
|
||||
preserveLinksCheckbox.disabled = true;
|
||||
this.markdownOptions.preserveLinks = false;
|
||||
}
|
||||
|
||||
const includeImagesCheckbox = this.modal.querySelector('input[name="includeImages"]');
|
||||
if (includeImagesCheckbox) {
|
||||
includeImagesCheckbox.disabled = true;
|
||||
}
|
||||
} else if (e.target.name === 'textOnly' && !e.target.checked) {
|
||||
// Re-enable options when text-only is disabled
|
||||
const preserveLinksCheckbox = this.modal.querySelector('input[name="preserveLinks"]');
|
||||
if (preserveLinksCheckbox) {
|
||||
preserveLinksCheckbox.disabled = false;
|
||||
}
|
||||
|
||||
const includeImagesCheckbox = this.modal.querySelector('input[name="includeImages"]');
|
||||
if (includeImagesCheckbox) {
|
||||
includeImagesCheckbox.disabled = false;
|
||||
}
|
||||
}
|
||||
|
||||
// Update markdown content
|
||||
await this.updateContent();
|
||||
});
|
||||
});
|
||||
|
||||
// Action buttons
|
||||
this.modal.querySelector('.c4ai-copy-markdown-btn').addEventListener('click', () => {
|
||||
this.copyToClipboard();
|
||||
});
|
||||
|
||||
this.modal.querySelector('.c4ai-download-btn').addEventListener('click', () => {
|
||||
this.downloadMarkdown();
|
||||
});
|
||||
}
|
||||
|
||||
switchTab(tabName) {
|
||||
// Update active tab
|
||||
this.modal.querySelectorAll('.c4ai-tab').forEach(tab => {
|
||||
tab.classList.toggle('active', tab.dataset.tab === tabName);
|
||||
});
|
||||
|
||||
// Update active pane
|
||||
this.modal.querySelectorAll('.c4ai-preview-pane').forEach(pane => {
|
||||
pane.classList.toggle('active', pane.dataset.pane === tabName);
|
||||
});
|
||||
}
|
||||
|
||||
async updateContent() {
|
||||
if (!this.onGenerateMarkdown) return;
|
||||
|
||||
try {
|
||||
// Generate markdown with current options
|
||||
this.currentMarkdown = await this.onGenerateMarkdown(this.markdownOptions);
|
||||
|
||||
// Update markdown pane
|
||||
const markdownPane = this.modal.querySelector('[data-pane="markdown"]');
|
||||
markdownPane.innerHTML = `<pre><code>${this.escapeHtml(this.currentMarkdown)}</code></pre>`;
|
||||
|
||||
// Update preview pane
|
||||
const previewPane = this.modal.querySelector('[data-pane="preview"]');
|
||||
|
||||
// Use marked.js if available
|
||||
if (window.marked) {
|
||||
marked.setOptions({
|
||||
gfm: true,
|
||||
breaks: true,
|
||||
tables: true,
|
||||
headerIds: false,
|
||||
mangle: false
|
||||
});
|
||||
|
||||
const html = marked.parse(this.currentMarkdown);
|
||||
previewPane.innerHTML = `<div class="c4ai-markdown-preview">${html}</div>`;
|
||||
} else {
|
||||
// Fallback
|
||||
previewPane.innerHTML = `<div class="c4ai-markdown-preview"><pre>${this.escapeHtml(this.currentMarkdown)}</pre></div>`;
|
||||
}
|
||||
} catch (error) {
|
||||
console.error('Error generating markdown:', error);
|
||||
this.showNotification('Error generating markdown', 'error');
|
||||
}
|
||||
}
|
||||
|
||||
async copyToClipboard() {
|
||||
try {
|
||||
await navigator.clipboard.writeText(this.currentMarkdown);
|
||||
this.showNotification('Markdown copied to clipboard!');
|
||||
} catch (err) {
|
||||
console.error('Failed to copy:', err);
|
||||
this.showNotification('Failed to copy. Please try again.', 'error');
|
||||
}
|
||||
}
|
||||
|
||||
async downloadMarkdown() {
|
||||
const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, -5);
|
||||
const filename = `crawl4ai-export-${timestamp}.md`;
|
||||
|
||||
// Create blob and download
|
||||
const blob = new Blob([this.currentMarkdown], { type: 'text/markdown' });
|
||||
const url = URL.createObjectURL(blob);
|
||||
|
||||
const a = document.createElement('a');
|
||||
a.href = url;
|
||||
a.download = filename;
|
||||
document.body.appendChild(a);
|
||||
a.click();
|
||||
document.body.removeChild(a);
|
||||
URL.revokeObjectURL(url);
|
||||
|
||||
this.showNotification(`Downloaded ${filename}`);
|
||||
}
|
||||
|
||||
showNotification(message, type = 'success') {
|
||||
const notification = document.createElement('div');
|
||||
notification.className = `c4ai-notification c4ai-notification-${type}`;
|
||||
notification.textContent = message;
|
||||
|
||||
document.body.appendChild(notification);
|
||||
|
||||
// Animate in
|
||||
setTimeout(() => notification.classList.add('show'), 10);
|
||||
|
||||
// Remove after 3 seconds
|
||||
setTimeout(() => {
|
||||
notification.classList.remove('show');
|
||||
setTimeout(() => notification.remove(), 300);
|
||||
}, 3000);
|
||||
}
|
||||
|
||||
escapeHtml(unsafe) {
|
||||
return unsafe
|
||||
.replace(/&/g, "&")
|
||||
.replace(/</g, "<")
|
||||
.replace(/>/g, ">")
|
||||
.replace(/"/g, """)
|
||||
.replace(/'/g, "'");
|
||||
}
|
||||
|
||||
// Get current options
|
||||
getOptions() {
|
||||
return { ...this.markdownOptions };
|
||||
}
|
||||
|
||||
// Update options programmatically
|
||||
setOptions(options) {
|
||||
this.markdownOptions = { ...this.markdownOptions, ...options };
|
||||
|
||||
// Update checkboxes to reflect new options
|
||||
Object.entries(options).forEach(([key, value]) => {
|
||||
const checkbox = this.modal?.querySelector(`input[name="${key}"]`);
|
||||
if (checkbox && typeof value === 'boolean') {
|
||||
checkbox.checked = value;
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// Cleanup
|
||||
destroy() {
|
||||
if (this.modal) {
|
||||
this.modal.remove();
|
||||
this.modal = null;
|
||||
}
|
||||
this.onGenerateMarkdown = null;
|
||||
}
|
||||
}
|
||||
|
||||
// Export for use in other scripts
|
||||
if (typeof window !== 'undefined') {
|
||||
window.MarkdownPreviewModal = MarkdownPreviewModal;
|
||||
}
|
||||
2841
docs/md_v2/apps/crawl4ai-assistant/content/overlay.css
Normal file
2515
docs/md_v2/apps/crawl4ai-assistant/content/scriptBuilder.js
Normal file
253
docs/md_v2/apps/crawl4ai-assistant/content/shared/utils.js
Normal file
@@ -0,0 +1,253 @@
|
||||
// Shared utilities for Crawl4AI Chrome Extension
|
||||
|
||||
// Make element draggable by its titlebar
|
||||
function makeDraggable(element) {
|
||||
let isDragging = false;
|
||||
let startX, startY, initialX, initialY;
|
||||
|
||||
const titlebar = element.querySelector('.c4ai-toolbar-titlebar, .c4ai-titlebar');
|
||||
if (!titlebar) return;
|
||||
|
||||
titlebar.addEventListener('mousedown', (e) => {
|
||||
// Don't drag if clicking on buttons
|
||||
if (e.target.classList.contains('c4ai-dot') || e.target.closest('button')) return;
|
||||
|
||||
isDragging = true;
|
||||
startX = e.clientX;
|
||||
startY = e.clientY;
|
||||
|
||||
const rect = element.getBoundingClientRect();
|
||||
initialX = rect.left;
|
||||
initialY = rect.top;
|
||||
|
||||
element.style.transition = 'none';
|
||||
titlebar.style.cursor = 'grabbing';
|
||||
});
|
||||
|
||||
document.addEventListener('mousemove', (e) => {
|
||||
if (!isDragging) return;
|
||||
|
||||
const deltaX = e.clientX - startX;
|
||||
const deltaY = e.clientY - startY;
|
||||
|
||||
element.style.left = `${initialX + deltaX}px`;
|
||||
element.style.top = `${initialY + deltaY}px`;
|
||||
element.style.right = 'auto';
|
||||
});
|
||||
|
||||
document.addEventListener('mouseup', () => {
|
||||
if (isDragging) {
|
||||
isDragging = false;
|
||||
element.style.transition = '';
|
||||
titlebar.style.cursor = 'grab';
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// Make element draggable by a specific header element
|
||||
function makeDraggableByHeader(element) {
|
||||
let isDragging = false;
|
||||
let startX, startY, initialX, initialY;
|
||||
|
||||
const header = element.querySelector('.c4ai-debugger-header');
|
||||
if (!header) return;
|
||||
|
||||
header.addEventListener('mousedown', (e) => {
|
||||
// Don't drag if clicking on close button
|
||||
if (e.target.id === 'c4ai-close-debugger' || e.target.closest('#c4ai-close-debugger')) return;
|
||||
|
||||
isDragging = true;
|
||||
startX = e.clientX;
|
||||
startY = e.clientY;
|
||||
|
||||
const rect = element.getBoundingClientRect();
|
||||
initialX = rect.left;
|
||||
initialY = rect.top;
|
||||
|
||||
element.style.transition = 'none';
|
||||
header.style.cursor = 'grabbing';
|
||||
});
|
||||
|
||||
document.addEventListener('mousemove', (e) => {
|
||||
if (!isDragging) return;
|
||||
|
||||
const deltaX = e.clientX - startX;
|
||||
const deltaY = e.clientY - startY;
|
||||
|
||||
element.style.left = `${initialX + deltaX}px`;
|
||||
element.style.top = `${initialY + deltaY}px`;
|
||||
element.style.right = 'auto';
|
||||
});
|
||||
|
||||
document.addEventListener('mouseup', () => {
|
||||
if (isDragging) {
|
||||
isDragging = false;
|
||||
element.style.transition = '';
|
||||
header.style.cursor = 'grab';
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// Escape HTML for safe display
|
||||
function escapeHtml(text) {
|
||||
const div = document.createElement('div');
|
||||
div.textContent = text;
|
||||
return div.innerHTML;
|
||||
}
|
||||
|
||||
// Apply syntax highlighting to Python code
|
||||
function applySyntaxHighlighting(codeElement) {
|
||||
const code = codeElement.textContent;
|
||||
|
||||
// Split by lines to handle line-by-line
|
||||
const lines = code.split('\n');
|
||||
const highlightedLines = lines.map(line => {
|
||||
let highlightedLine = escapeHtml(line);
|
||||
|
||||
// Skip if line is empty
|
||||
if (!highlightedLine.trim()) return highlightedLine;
|
||||
|
||||
// Comments (lines starting with #)
|
||||
if (highlightedLine.trim().startsWith('#')) {
|
||||
return `<span class="c4ai-comment">${highlightedLine}</span>`;
|
||||
}
|
||||
|
||||
// Triple quoted strings
|
||||
if (highlightedLine.includes('"""')) {
|
||||
highlightedLine = highlightedLine.replace(/(""".*?""")/g, '<span class="c4ai-string">$1</span>');
|
||||
}
|
||||
|
||||
// Regular strings - single and double quotes
|
||||
highlightedLine = highlightedLine.replace(/(["'])([^"']*)\1/g, '<span class="c4ai-string">$1$2$1</span>');
|
||||
|
||||
// Keywords - only highlight if not inside a string
|
||||
const keywords = ['import', 'from', 'async', 'def', 'await', 'try', 'except', 'with', 'as', 'for', 'if', 'else', 'elif', 'return', 'print', 'open', 'and', 'or', 'not', 'in', 'is', 'class', 'self', 'None', 'True', 'False', '__name__', '__main__'];
|
||||
|
||||
keywords.forEach(keyword => {
|
||||
// Use word boundaries and lookahead/lookbehind to ensure we're not in a string
|
||||
const regex = new RegExp(`\\b(${keyword})\\b(?![^<]*</span>)`, 'g');
|
||||
highlightedLine = highlightedLine.replace(regex, '<span class="c4ai-keyword">$1</span>');
|
||||
});
|
||||
|
||||
// Functions (word followed by parenthesis)
|
||||
highlightedLine = highlightedLine.replace(/\b([a-zA-Z_]\w*)\s*\(/g, '<span class="c4ai-function">$1</span>(');
|
||||
|
||||
return highlightedLine;
|
||||
});
|
||||
|
||||
codeElement.innerHTML = highlightedLines.join('\n');
|
||||
}
|
||||
|
||||
// Apply syntax highlighting to JavaScript code
|
||||
function applySyntaxHighlightingJS(codeElement) {
|
||||
const code = codeElement.textContent;
|
||||
|
||||
// Split by lines to handle line-by-line
|
||||
const lines = code.split('\n');
|
||||
const highlightedLines = lines.map(line => {
|
||||
let highlightedLine = escapeHtml(line);
|
||||
|
||||
// Skip if line is empty
|
||||
if (!highlightedLine.trim()) return highlightedLine;
|
||||
|
||||
// Comments
|
||||
if (highlightedLine.trim().startsWith('//')) {
|
||||
return `<span class="c4ai-comment">${highlightedLine}</span>`;
|
||||
}
|
||||
|
||||
// Multi-line comments
|
||||
highlightedLine = highlightedLine.replace(/(\/\*.*?\*\/)/g, '<span class="c4ai-comment">$1</span>');
|
||||
|
||||
// Template literals
|
||||
highlightedLine = highlightedLine.replace(/(`[^`]*`)/g, '<span class="c4ai-string">$1</span>');
|
||||
|
||||
// Regular strings - single and double quotes
|
||||
highlightedLine = highlightedLine.replace(/(["'])([^"']*)\1/g, '<span class="c4ai-string">$1$2$1</span>');
|
||||
|
||||
// Keywords
|
||||
const keywords = ['const', 'let', 'var', 'function', 'async', 'await', 'if', 'else', 'for', 'while', 'do', 'switch', 'case', 'break', 'continue', 'return', 'try', 'catch', 'finally', 'throw', 'new', 'this', 'class', 'extends', 'import', 'export', 'default', 'from', 'null', 'undefined', 'true', 'false'];
|
||||
|
||||
keywords.forEach(keyword => {
|
||||
const regex = new RegExp(`\\b(${keyword})\\b(?![^<]*</span>)`, 'g');
|
||||
highlightedLine = highlightedLine.replace(regex, '<span class="c4ai-keyword">$1</span>');
|
||||
});
|
||||
|
||||
// Functions and methods
|
||||
highlightedLine = highlightedLine.replace(/\b([a-zA-Z_$][\w$]*)\s*\(/g, '<span class="c4ai-function">$1</span>(');
|
||||
|
||||
// Numbers
|
||||
highlightedLine = highlightedLine.replace(/\b(\d+)\b/g, '<span class="c4ai-number">$1</span>');
|
||||
|
||||
return highlightedLine;
|
||||
});
|
||||
|
||||
codeElement.innerHTML = highlightedLines.join('\n');
|
||||
}
|
||||
|
||||
// Get element selector
|
||||
function getElementSelector(element) {
|
||||
// Priority: ID > unique class > tag with position
|
||||
if (element.id) {
|
||||
return `#${element.id}`;
|
||||
}
|
||||
|
||||
if (element.className && typeof element.className === 'string') {
|
||||
const classes = element.className.split(' ').filter(c => c && !c.startsWith('c4ai-'));
|
||||
if (classes.length > 0) {
|
||||
const selector = `.${classes[0]}`;
|
||||
if (document.querySelectorAll(selector).length === 1) {
|
||||
return selector;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Build a path selector
|
||||
const path = [];
|
||||
let current = element;
|
||||
|
||||
while (current && current !== document.body) {
|
||||
const tagName = current.tagName.toLowerCase();
|
||||
const parent = current.parentElement;
|
||||
|
||||
if (parent) {
|
||||
const siblings = Array.from(parent.children);
|
||||
const index = siblings.indexOf(current) + 1;
|
||||
|
||||
if (siblings.filter(s => s.tagName === current.tagName).length > 1) {
|
||||
path.unshift(`${tagName}:nth-child(${index})`);
|
||||
} else {
|
||||
path.unshift(tagName);
|
||||
}
|
||||
} else {
|
||||
path.unshift(tagName);
|
||||
}
|
||||
|
||||
current = parent;
|
||||
}
|
||||
|
||||
return path.join(' > ');
|
||||
}
|
||||
|
||||
// Check if element is part of our extension UI
|
||||
function isOurElement(element) {
|
||||
return element.classList.contains('c4ai-highlight-box') ||
|
||||
element.classList.contains('c4ai-toolbar') ||
|
||||
element.closest('.c4ai-toolbar') ||
|
||||
element.classList.contains('c4ai-script-toolbar') ||
|
||||
element.closest('.c4ai-script-toolbar') ||
|
||||
element.closest('.c4ai-field-dialog') ||
|
||||
element.closest('.c4ai-code-modal') ||
|
||||
element.closest('.c4ai-wait-dialog') ||
|
||||
element.closest('.c4ai-timeline-modal');
|
||||
}
|
||||
|
||||
// Export utilities
|
||||
window.C4AI_Utils = {
|
||||
makeDraggable,
|
||||
makeDraggableByHeader,
|
||||
escapeHtml,
|
||||
applySyntaxHighlighting,
|
||||
applySyntaxHighlightingJS,
|
||||
getElementSelector,
|
||||
isOurElement
|
||||
};
|
||||
BIN
docs/md_v2/apps/crawl4ai-assistant/crawl4ai-assistant-v1.2.1.zip
Normal file
BIN
docs/md_v2/apps/crawl4ai-assistant/crawl4ai-assistant-v1.3.0.zip
Normal file
BIN
docs/md_v2/apps/crawl4ai-assistant/icons/favicon.ico
Normal file
|
After Width: | Height: | Size: 3.4 KiB |
BIN
docs/md_v2/apps/crawl4ai-assistant/icons/icon-128.png
Normal file
|
After Width: | Height: | Size: 1.6 KiB |
BIN
docs/md_v2/apps/crawl4ai-assistant/icons/icon-16.png
Normal file
|
After Width: | Height: | Size: 1.6 KiB |
BIN
docs/md_v2/apps/crawl4ai-assistant/icons/icon-48.png
Normal file
|
After Width: | Height: | Size: 1.6 KiB |
974
docs/md_v2/apps/crawl4ai-assistant/index.html
Normal file
@@ -0,0 +1,974 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Crawl4AI Assistant - Chrome Extension for Visual Web Scraping</title>
|
||||
<link rel="stylesheet" href="assistant.css">
|
||||
</head>
|
||||
<body>
|
||||
<div class="terminal-container">
|
||||
<div class="header">
|
||||
<div class="header-content">
|
||||
<div class="logo-section">
|
||||
<img src="../../img/favicon-32x32.png" alt="Crawl4AI Logo" class="logo">
|
||||
<div>
|
||||
<h1>Crawl4AI Assistant</h1>
|
||||
<p class="tagline">Chrome Extension for Visual Web Scraping</p>
|
||||
</div>
|
||||
</div>
|
||||
<nav class="nav-links">
|
||||
<a href="../../" class="nav-link">← Back to Docs</a>
|
||||
<a href="../" class="nav-link">All Apps</a>
|
||||
<a href="https://github.com/unclecode/crawl4ai" class="nav-link" target="_blank">GitHub</a>
|
||||
</nav>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="content">
|
||||
<!-- Video Section -->
|
||||
<section class="video-section">
|
||||
<div class="video-wrapper">
|
||||
<video autoplay loop muted playsinline class="demo-video">
|
||||
<source src="demo.mp4" type="video/mp4">
|
||||
Your browser does not support the video tag.
|
||||
</video>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Cloud Announcement Banner -->
|
||||
<section class="cloud-banner-section">
|
||||
<div class="cloud-banner">
|
||||
<div class="cloud-banner-content">
|
||||
<div class="cloud-banner-text">
|
||||
<h3>You don't need Puppeteer. You need Crawl4AI Cloud.</h3>
|
||||
<p>One API call. JS-rendered. No browser cluster to maintain.</p>
|
||||
</div>
|
||||
<button class="cloud-banner-btn" id="joinWaitlistBanner">
|
||||
Get API Key →
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Introduction -->
|
||||
<section class="intro-section">
|
||||
<div class="terminal-window">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">About Crawl4AI Assistant</span>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<p>Transform any website into structured data with just a few clicks! The Crawl4AI Assistant Chrome Extension provides three powerful tools for web scraping and data extraction.</p>
|
||||
|
||||
<div style="background: #0fbbaa; color: #070708; padding: 12px 16px; border-radius: 8px; margin: 16px 0; font-weight: 600;">
|
||||
🎉 NEW: Click2Crawl extracts data INSTANTLY without any LLM! Test your schema and see JSON results immediately in the browser!
|
||||
</div>
|
||||
|
||||
<div class="features-grid">
|
||||
<div class="feature-card">
|
||||
<span class="feature-icon">🎯</span>
|
||||
<h3>Click2Crawl</h3>
|
||||
<p>Visual data extraction - click elements to build schemas instantly!</p>
|
||||
</div>
|
||||
<div class="feature-card">
|
||||
<span class="feature-icon">🔴</span>
|
||||
<h3>Script Builder <span style="color: #f380f5; font-size: 0.75rem;">(Alpha)</span></h3>
|
||||
<p>Record browser actions to create automation scripts</p>
|
||||
</div>
|
||||
<div class="feature-card">
|
||||
<span class="feature-icon">📝</span>
|
||||
<h3>Markdown Extraction <span style="color: #0fbbaa; font-size: 0.75rem;">(New!)</span></h3>
|
||||
<p>Convert any webpage content to clean markdown with Visual Text Mode</p>
|
||||
</div>
|
||||
<!-- <div class="feature-card">
|
||||
<span class="feature-icon">🐍</span>
|
||||
<h3>Python Code</h3>
|
||||
<p>Get production-ready Crawl4AI code instantly</p>
|
||||
</div> -->
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Quick Start -->
|
||||
<section class="quickstart-section">
|
||||
<h2>Quick Start</h2>
|
||||
<div class="terminal-window">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">Installation</span>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<div class="installation-steps">
|
||||
<div class="step">
|
||||
<span class="step-number">1</span>
|
||||
<div class="step-content">
|
||||
<h4>Download the Extension</h4>
|
||||
<p>Get the latest release from GitHub or use the button below</p>
|
||||
<a href="crawl4ai-assistant-v1.3.0.zip" class="download-button" download>
|
||||
<span class="button-icon">↓</span>
|
||||
Download Extension (v1.3.0)
|
||||
</a>
|
||||
</div>
|
||||
</div>
|
||||
<div class="step">
|
||||
<span class="step-number">2</span>
|
||||
<div class="step-content">
|
||||
<h4>Load in Chrome</h4>
|
||||
<p>Navigate to <code>chrome://extensions/</code> and enable Developer Mode</p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="step">
|
||||
<span class="step-number">3</span>
|
||||
<div class="step-content">
|
||||
<h4>Load Unpacked</h4>
|
||||
<p>Click "Load unpacked" and select the extracted extension folder</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Interactive Tools Section -->
|
||||
<section class="interactive-tools">
|
||||
<h2>Explore Our Tools</h2>
|
||||
|
||||
<div class="tools-container">
|
||||
<!-- Left Panel - Tool Selector -->
|
||||
<div class="tools-panel">
|
||||
<div class="tool-selector active" data-tool="click2crawl">
|
||||
<div class="tool-icon">🎯</div>
|
||||
<div class="tool-info">
|
||||
<h3>Click2Crawl</h3>
|
||||
<p>Visual data extraction</p>
|
||||
</div>
|
||||
<div class="tool-status">Available</div>
|
||||
</div>
|
||||
|
||||
<div class="tool-selector" data-tool="script-builder">
|
||||
<div class="tool-icon">🔴</div>
|
||||
<div class="tool-info">
|
||||
<h3>Script Builder</h3>
|
||||
<p>Browser automation</p>
|
||||
</div>
|
||||
<div class="tool-status alpha">Alpha</div>
|
||||
</div>
|
||||
|
||||
<div class="tool-selector" data-tool="markdown-extraction">
|
||||
<div class="tool-icon">📝</div>
|
||||
<div class="tool-info">
|
||||
<h3>Markdown Extraction</h3>
|
||||
<p>Content to markdown</p>
|
||||
</div>
|
||||
<div class="tool-status new">New!</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Right Panel - Tool Details -->
|
||||
<div class="tool-details">
|
||||
<!-- Click2Crawl Details -->
|
||||
<div class="tool-content active" id="click2crawl">
|
||||
<div class="tool-header">
|
||||
<h3>🎯 Click2Crawl</h3>
|
||||
<span class="tool-tagline">Click elements to build extraction schemas - No LLM needed!</span>
|
||||
</div>
|
||||
|
||||
<div class="tool-steps">
|
||||
<div class="step-item">
|
||||
<div class="step-number">1</div>
|
||||
<div class="step-content">
|
||||
<h4>Select Container</h4>
|
||||
<p>Click on any repeating element like product cards or articles. Use up/down navigation to fine-tune selection!</p>
|
||||
<div class="step-visual">
|
||||
<span class="highlight-green">■</span> Container highlighted in green
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="step-item">
|
||||
<div class="step-number">2</div>
|
||||
<div class="step-content">
|
||||
<h4>Click Fields to Extract</h4>
|
||||
<p>Click on data fields inside the container - choose text, links, images, or attributes</p>
|
||||
<div class="step-visual">
|
||||
<span class="highlight-pink">■</span> Fields highlighted in pink
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="step-item">
|
||||
<div class="step-number">3</div>
|
||||
<div class="step-content">
|
||||
<h4>Test & Extract Data Instantly!</h4>
|
||||
<p>🎉 Click "Test Schema" to see extracted JSON immediately - no LLM or coding required!</p>
|
||||
<div class="step-visual">
|
||||
<span class="highlight-accent">⚡</span> See extracted JSON immediately
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tool-features">
|
||||
<div class="feature-tag">🚀 Zero LLM dependency</div>
|
||||
<div class="feature-tag">📊 Instant JSON extraction</div>
|
||||
<div class="feature-tag">🎯 Visual element selection</div>
|
||||
<div class="feature-tag">🐍 Export Python code</div>
|
||||
<div class="feature-tag">✨ Live preview</div>
|
||||
<div class="feature-tag">📥 Download results</div>
|
||||
<div class="feature-tag">📝 Export to markdown</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Script Builder Details -->
|
||||
<div class="tool-content" id="script-builder">
|
||||
<div class="tool-header">
|
||||
<h3>🔴 Script Builder</h3>
|
||||
<span class="tool-tagline">Record actions, generate automation</span>
|
||||
</div>
|
||||
|
||||
<div class="tool-steps">
|
||||
<div class="step-item">
|
||||
<div class="step-number">1</div>
|
||||
<div class="step-content">
|
||||
<h4>Hit Record</h4>
|
||||
<p>Start capturing your browser interactions</p>
|
||||
<div class="step-visual">
|
||||
<span class="recording-dot">●</span> Recording indicator
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="step-item">
|
||||
<div class="step-number">2</div>
|
||||
<div class="step-content">
|
||||
<h4>Interact Naturally</h4>
|
||||
<p>Click, type, scroll - everything is captured</p>
|
||||
<div class="step-visual">
|
||||
<span class="action-icon">🖱️</span> <span class="action-icon">⌨️</span> <span class="action-icon">📜</span>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="step-item">
|
||||
<div class="step-number">3</div>
|
||||
<div class="step-content">
|
||||
<h4>Export Script</h4>
|
||||
<p>Get JavaScript for Crawl4AI's js_code parameter</p>
|
||||
<div class="step-visual">
|
||||
<span class="highlight-accent">📝</span> Automation ready
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tool-features">
|
||||
<div class="feature-tag">Smart action grouping</div>
|
||||
<div class="feature-tag">Wait detection</div>
|
||||
<div class="feature-tag">Keyboard shortcuts</div>
|
||||
<div class="feature-tag alpha-tag">Alpha version</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Markdown Extraction Details -->
|
||||
<div class="tool-content" id="markdown-extraction">
|
||||
<div class="tool-header">
|
||||
<h3>📝 Markdown Extraction</h3>
|
||||
<span class="tool-tagline">Convert webpage content to clean markdown "as you see"</span>
|
||||
</div>
|
||||
|
||||
<div class="tool-steps">
|
||||
<div class="step-item">
|
||||
<div class="step-number">1</div>
|
||||
<div class="step-content">
|
||||
<h4>Ctrl/Cmd + Click</h4>
|
||||
<p>Hold Ctrl/Cmd and click multiple elements you want to extract</p>
|
||||
<div class="step-visual">
|
||||
<span class="highlight-green">🔢</span> Numbered selection badges
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="step-item">
|
||||
<div class="step-number">2</div>
|
||||
<div class="step-content">
|
||||
<h4>Enable Visual Text Mode</h4>
|
||||
<p>Extract content "as you see" - clean text without complex HTML structures</p>
|
||||
<div class="step-visual">
|
||||
<span class="highlight-accent">👁️</span> Visual Text Mode (As You See)
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="step-item">
|
||||
<div class="step-number">3</div>
|
||||
<div class="step-content">
|
||||
<h4>Export Clean Markdown</h4>
|
||||
<p>Get beautifully formatted markdown ready for documentation or LLMs</p>
|
||||
<div class="step-visual">
|
||||
<span class="highlight-pink">📄</span> Clean, readable output
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tool-features">
|
||||
<div class="feature-tag">Multi-select with Ctrl/Cmd</div>
|
||||
<div class="feature-tag">Visual Text Mode (As You See)</div>
|
||||
<div class="feature-tag">Clean markdown output</div>
|
||||
<div class="feature-tag">Export to Crawl4AI Cloud (soon)</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Interactive Code Examples -->
|
||||
<section class="code-showcase">
|
||||
<h2>See the Generated Code & Extracted Data</h2>
|
||||
|
||||
<div class="code-tabs">
|
||||
<button class="code-tab active" data-example="schema">🎯 Click2Crawl</button>
|
||||
<button class="code-tab" data-example="script">🔴 Script Builder</button>
|
||||
<button class="code-tab" data-example="markdown">📝 Markdown Extraction</button>
|
||||
</div>
|
||||
|
||||
<div class="code-examples">
|
||||
<!-- Click2Crawl Code -->
|
||||
<div class="code-example active" id="code-schema">
|
||||
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 16px;">
|
||||
<!-- Python Code -->
|
||||
<div class="terminal-window">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">click2crawl_extraction.py</span>
|
||||
<button class="copy-button" data-code="schema-python">Copy</button>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<pre><code><span class="comment">#!/usr/bin/env python3</span>
|
||||
<span class="comment">"""
|
||||
🎉 NO LLM NEEDED! Direct extraction with CSS selectors
|
||||
Generated by Crawl4AI Chrome Extension - Click2Crawl
|
||||
"""</span>
|
||||
|
||||
<span class="keyword">import</span> asyncio
|
||||
<span class="keyword">import</span> json
|
||||
<span class="keyword">from</span> crawl4ai <span class="keyword">import</span> AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
<span class="keyword">from</span> crawl4ai.extraction_strategy <span class="keyword">import</span> JsonCssExtractionStrategy
|
||||
|
||||
<span class="comment"># The EXACT schema from Click2Crawl - no guessing!</span>
|
||||
EXTRACTION_SCHEMA = {
|
||||
<span class="string">"name"</span>: <span class="string">"Product Catalog"</span>,
|
||||
<span class="string">"baseSelector"</span>: <span class="string">"div.product-card"</span>, <span class="comment"># The container you selected</span>
|
||||
<span class="string">"fields"</span>: [
|
||||
{
|
||||
<span class="string">"name"</span>: <span class="string">"title"</span>,
|
||||
<span class="string">"selector"</span>: <span class="string">"h3.product-title"</span>,
|
||||
<span class="string">"type"</span>: <span class="string">"text"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"name"</span>: <span class="string">"price"</span>,
|
||||
<span class="string">"selector"</span>: <span class="string">"span.price"</span>,
|
||||
<span class="string">"type"</span>: <span class="string">"text"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"name"</span>: <span class="string">"image"</span>,
|
||||
<span class="string">"selector"</span>: <span class="string">"img.product-img"</span>,
|
||||
<span class="string">"type"</span>: <span class="string">"attribute"</span>,
|
||||
<span class="string">"attribute"</span>: <span class="string">"src"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"name"</span>: <span class="string">"link"</span>,
|
||||
<span class="string">"selector"</span>: <span class="string">"a.product-link"</span>,
|
||||
<span class="string">"type"</span>: <span class="string">"attribute"</span>,
|
||||
<span class="string">"attribute"</span>: <span class="string">"href"</span>
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
<span class="keyword">async</span> <span class="keyword">def</span> <span class="function">extract_data</span>(url: str):
|
||||
<span class="comment"># Direct extraction - no LLM API calls!</span>
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema=EXTRACTION_SCHEMA)
|
||||
|
||||
<span class="keyword">async</span> <span class="keyword">with</span> AsyncWebCrawler() <span class="keyword">as</span> crawler:
|
||||
result = <span class="keyword">await</span> crawler.arun(
|
||||
url=url,
|
||||
config=CrawlerRunConfig(extraction_strategy=extraction_strategy)
|
||||
)
|
||||
|
||||
<span class="keyword">if</span> result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
<span class="keyword">print</span>(<span class="string">f"✅ Extracted {len(data)} items instantly!"</span>)
|
||||
|
||||
<span class="comment"># Save to file</span>
|
||||
<span class="keyword">with</span> open(<span class="string">'products.json'</span>, <span class="string">'w'</span>) <span class="keyword">as</span> f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
<span class="keyword">return</span> data
|
||||
|
||||
<span class="comment"># Run extraction on any similar page!</span>
|
||||
data = asyncio.run(extract_data(<span class="string">"https://example.com/products"</span>))
|
||||
|
||||
<span class="comment"># 🎯 Result: Clean JSON data, no LLM costs, instant results!</span></code></pre>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Extracted JSON Data -->
|
||||
<div class="terminal-window">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">extracted_data.json</span>
|
||||
<button class="copy-button" data-code="schema-json">Copy</button>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<pre><code><span class="comment">// 🎉 Instantly extracted from the page - no coding required!</span>
|
||||
[
|
||||
{
|
||||
<span class="string">"title"</span>: <span class="string">"Wireless Bluetooth Headphones"</span>,
|
||||
<span class="string">"price"</span>: <span class="string">"$79.99"</span>,
|
||||
<span class="string">"image"</span>: <span class="string">"https://example.com/images/headphones-bt-01.jpg"</span>,
|
||||
<span class="string">"link"</span>: <span class="string">"/products/wireless-bluetooth-headphones"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"title"</span>: <span class="string">"Smart Watch Pro 2024"</span>,
|
||||
<span class="string">"price"</span>: <span class="string">"$299.00"</span>,
|
||||
<span class="string">"image"</span>: <span class="string">"https://example.com/images/smartwatch-pro.jpg"</span>,
|
||||
<span class="string">"link"</span>: <span class="string">"/products/smart-watch-pro-2024"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"title"</span>: <span class="string">"4K Webcam for Streaming"</span>,
|
||||
<span class="string">"price"</span>: <span class="string">"$149.99"</span>,
|
||||
<span class="string">"image"</span>: <span class="string">"https://example.com/images/webcam-4k.jpg"</span>,
|
||||
<span class="string">"link"</span>: <span class="string">"/products/4k-webcam-streaming"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"title"</span>: <span class="string">"Mechanical Gaming Keyboard RGB"</span>,
|
||||
<span class="string">"price"</span>: <span class="string">"$129.99"</span>,
|
||||
<span class="string">"image"</span>: <span class="string">"https://example.com/images/keyboard-gaming.jpg"</span>,
|
||||
<span class="string">"link"</span>: <span class="string">"/products/mechanical-gaming-keyboard"</span>
|
||||
},
|
||||
{
|
||||
<span class="string">"title"</span>: <span class="string">"USB-C Hub 7-in-1"</span>,
|
||||
<span class="string">"price"</span>: <span class="string">"$45.99"</span>,
|
||||
<span class="string">"image"</span>: <span class="string">"https://example.com/images/usbc-hub.jpg"</span>,
|
||||
<span class="string">"link"</span>: <span class="string">"/products/usb-c-hub-7in1"</span>
|
||||
}
|
||||
]</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Script Builder Code -->
|
||||
<div class="code-example" id="code-script">
|
||||
<div class="terminal-window">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">automation_script.py</span>
|
||||
<button class="copy-button" data-code="script">Copy</button>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<pre><code><span class="keyword">import</span> asyncio
|
||||
<span class="keyword">from</span> crawl4ai <span class="keyword">import</span> AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
<span class="comment"># JavaScript generated from your recorded actions</span>
|
||||
js_script = <span class="string">"""
|
||||
// Search for products
|
||||
document.querySelector('button.search-toggle').click();
|
||||
await new Promise(r => setTimeout(r, 500));
|
||||
|
||||
// Type search query
|
||||
const searchInput = document.querySelector('input#search');
|
||||
searchInput.value = 'wireless headphones';
|
||||
searchInput.dispatchEvent(new Event('input', {bubbles: true}));
|
||||
|
||||
// Submit search
|
||||
searchInput.dispatchEvent(new KeyboardEvent('keydown', {
|
||||
key: 'Enter', keyCode: 13, bubbles: true
|
||||
}));
|
||||
|
||||
// Wait for results
|
||||
await new Promise(r => setTimeout(r, 2000));
|
||||
|
||||
// Click first product
|
||||
document.querySelector('.product-item:first-child').click();
|
||||
|
||||
// Wait for product page
|
||||
await new Promise(r => setTimeout(r, 1000));
|
||||
|
||||
// Add to cart
|
||||
document.querySelector('button.add-to-cart').click();
|
||||
"""</span>
|
||||
|
||||
<span class="keyword">async</span> <span class="keyword">def</span> <span class="function">automate_shopping</span>():
|
||||
config = CrawlerRunConfig(
|
||||
js_code=js_script,
|
||||
wait_for=<span class="string">"css:.cart-confirmation"</span>,
|
||||
screenshot=<span class="keyword">True</span>
|
||||
)
|
||||
|
||||
<span class="keyword">async</span> <span class="keyword">with</span> AsyncWebCrawler() <span class="keyword">as</span> crawler:
|
||||
result = <span class="keyword">await</span> crawler.arun(
|
||||
url=<span class="string">"https://shop.example.com"</span>,
|
||||
config=config
|
||||
)
|
||||
<span class="keyword">print</span>(<span class="string">f"✓ Automation complete: {result.url}"</span>)
|
||||
<span class="keyword">return</span> result
|
||||
|
||||
asyncio.run(automate_shopping())</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Markdown Extraction Output -->
|
||||
<div class="code-example" id="code-markdown">
|
||||
<div class="terminal-window">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">extracted_content.md</span>
|
||||
<button class="copy-button" data-code="markdown">Copy</button>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<pre><code><span class="comment"># Extracted from Hacker News with Visual Text Mode 👁️</span>
|
||||
|
||||
<span class="string">1. **Show HN: I built a tool to find and reach out to YouTubers** (hellosimply.io)
|
||||
84 points by erickim 2 hours ago | hide | 31 comments
|
||||
|
||||
2. **The 24 Hour Restaurant** (logicmag.io)
|
||||
124 points by helsinkiandrew 5 hours ago | hide | 52 comments
|
||||
|
||||
3. **Building a Better Bloom Filter in Rust** (carlmastrangelo.com)
|
||||
89 points by carlmastrangelo 3 hours ago | hide | 27 comments
|
||||
|
||||
---
|
||||
|
||||
### Article: The 24 Hour Restaurant
|
||||
|
||||
In New York City, the 24-hour restaurant is becoming extinct. What we lose when we can no longer eat whenever we want.
|
||||
|
||||
When I first moved to New York, I loved that I could get a full meal at 3 AM. Not just pizza or fast food, but a proper sit-down dinner with table service and a menu that ran for pages. The city that never sleeps had restaurants that matched its rhythm.
|
||||
|
||||
Today, finding a 24-hour restaurant in Manhattan requires genuine effort. The pandemic accelerated a decline that was already underway, but the roots go deeper: rising rents, changing labor laws, and shifting cultural patterns have all contributed to the death of round-the-clock dining.
|
||||
|
||||
---
|
||||
|
||||
### Product Review: Framework Laptop 16
|
||||
|
||||
**Specifications:**
|
||||
- Display: 16" 2560×1600 165Hz
|
||||
- Processor: AMD Ryzen 7 7840HS
|
||||
- Memory: 32GB DDR5-5600
|
||||
- Storage: 2TB NVMe Gen4
|
||||
- Price: Starting at $1,399
|
||||
|
||||
**Pros:**
|
||||
- Fully modular and repairable
|
||||
- Excellent Linux support
|
||||
- Great keyboard and trackpad
|
||||
- Expansion card system
|
||||
|
||||
**Cons:**
|
||||
- Battery life could be better
|
||||
- Slightly heavier than competitors
|
||||
- Fan noise under load</span></code></pre>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
||||
<!-- Crawl4AI Cloud Section -->
|
||||
<section class="cloud-section">
|
||||
<div class="cloud-announcement">
|
||||
<h2>Crawl4AI Cloud</h2>
|
||||
<p class="cloud-tagline">Your browser cluster without the cluster.</p>
|
||||
|
||||
<div class="cloud-features-preview">
|
||||
<div class="cloud-feature-item">
|
||||
⚡ POST /crawl
|
||||
</div>
|
||||
<div class="cloud-feature-item">
|
||||
🌐 JS-rendered pages
|
||||
</div>
|
||||
<div class="cloud-feature-item">
|
||||
📊 Schema extraction built-in
|
||||
</div>
|
||||
<div class="cloud-feature-item">
|
||||
💰 $0.001/page
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<button class="cloud-cta-button" id="joinWaitlist">
|
||||
Get Early Access →
|
||||
</button>
|
||||
|
||||
<p class="cloud-hint">See it extract your own data. Right now.</p>
|
||||
</div>
|
||||
|
||||
<!-- Hidden Signup Form -->
|
||||
<div class="signup-overlay" id="signupOverlay">
|
||||
<div class="signup-container" id="signupContainer">
|
||||
<button class="close-signup" id="closeSignup">×</button>
|
||||
|
||||
<div class="signup-content" id="signupForm">
|
||||
<h3>🚀 Join C4AI Cloud Waiting List</h3>
|
||||
<p>Be among the first to experience the future of web scraping</p>
|
||||
|
||||
<form id="waitlistForm" class="waitlist-form">
|
||||
<div class="form-field">
|
||||
<label for="userName">Your Name</label>
|
||||
<input type="text" id="userName" name="name" placeholder="John Doe" required>
|
||||
</div>
|
||||
|
||||
<div class="form-field">
|
||||
<label for="userEmail">Email Address</label>
|
||||
<input type="email" id="userEmail" name="email" placeholder="john@example.com" required>
|
||||
</div>
|
||||
|
||||
<div class="form-field">
|
||||
<label for="userCompany">Company (Optional)</label>
|
||||
<input type="text" id="userCompany" name="company" placeholder="Acme Inc.">
|
||||
</div>
|
||||
|
||||
<div class="form-field">
|
||||
<label for="useCase">What will you use Crawl4AI Cloud for?</label>
|
||||
<select id="useCase" name="useCase">
|
||||
<option value="">Select use case...</option>
|
||||
<option value="price-monitoring">Price Monitoring</option>
|
||||
<option value="news-aggregation">News Aggregation</option>
|
||||
<option value="market-research">Market Research</option>
|
||||
<option value="ai-training">AI Training Data</option>
|
||||
<option value="other">Other</option>
|
||||
</select>
|
||||
</div>
|
||||
|
||||
<button type="submit" class="submit-button">
|
||||
<span>🎯</span> Submit & Watch the Magic
|
||||
</button>
|
||||
</form>
|
||||
</div>
|
||||
|
||||
<!-- Crawling Animation -->
|
||||
<div class="crawl-animation" id="crawlAnimation" style="display: none;">
|
||||
<div class="terminal-window crawl-terminal">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">Crawl4AI Cloud Demo</span>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<pre id="crawlOutput" class="crawl-log"><code>$ crawl4ai cloud extract --url "signup-form" --auto-detect</code></pre>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="extracted-preview" id="extractedPreview" style="display: none;">
|
||||
<h4>📊 Extracted Data</h4>
|
||||
<pre class="json-preview"><code id="jsonOutput"></code></pre>
|
||||
</div>
|
||||
|
||||
<div class="success-message" id="successMessage" style="display: none;">
|
||||
<div class="success-icon">✅</div>
|
||||
<h3>Data Uploaded Successfully!</h3>
|
||||
<p>You're on the Crawl4AI Cloud waiting list!</p>
|
||||
<p>What you just witnessed:</p>
|
||||
<ul>
|
||||
<li>⚡ Real-time extraction of your form data</li>
|
||||
<li>🔄 Automatic schema detection</li>
|
||||
<li>📤 Instant cloud processing</li>
|
||||
<li>✨ No code required - just like that!</li>
|
||||
</ul>
|
||||
<p class="success-note">We'll notify you at <strong id="userEmailDisplay"></strong> when Crawl4AI Cloud launches!</p>
|
||||
<button class="continue-button" id="continueBtn">Continue Exploring</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Coming Soon Section -->
|
||||
<section class="coming-soon-section">
|
||||
<h2>More Features Coming Soon</h2>
|
||||
<div class="terminal-window">
|
||||
<div class="terminal-header">
|
||||
<span class="terminal-title">Roadmap</span>
|
||||
</div>
|
||||
<div class="terminal-content">
|
||||
<p class="intro-text">We're continuously expanding C4AI Assistant with powerful new features:</p>
|
||||
|
||||
<div class="coming-features">
|
||||
<div class="coming-feature">
|
||||
<div class="feature-header">
|
||||
<span class="feature-badge">Direct</span>
|
||||
<h3>Direct Data Download</h3>
|
||||
</div>
|
||||
<p>Skip the code generation entirely! Download extracted data directly from Click2Crawl as JSON or CSV files.</p>
|
||||
<div class="feature-preview">
|
||||
<code>📊 One-click download • No Python needed • Multiple export formats</code>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="coming-feature">
|
||||
<div class="feature-header">
|
||||
<span class="feature-badge">AI</span>
|
||||
<h3>Smart Field Detection</h3>
|
||||
</div>
|
||||
<p>AI-powered field detection for Click2Crawl that automatically suggests the most likely data fields on any page.</p>
|
||||
<div class="feature-preview">
|
||||
<code>🤖 Auto-detect fields • Smart naming • Pattern recognition</code>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="stay-tuned">
|
||||
<p>🚀 Stay tuned for updates! Follow our <a href="https://github.com/unclecode/crawl4ai" target="_blank">GitHub</a> for the latest releases.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- Footer -->
|
||||
<footer class="footer">
|
||||
<div class="footer-content">
|
||||
<div class="footer-section">
|
||||
<h4>Resources</h4>
|
||||
<ul>
|
||||
<li><a href="https://github.com/unclecode/crawl4ai">GitHub Repository</a></li>
|
||||
<li><a href="../../">Documentation</a></li>
|
||||
<li><a href="https://discord.gg/jP8KfhDhyN">Discord Community</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
<div class="footer-section">
|
||||
<h4>Connect</h4>
|
||||
<ul>
|
||||
<li><a href="https://twitter.com/unclecode">Twitter @unclecode</a></li>
|
||||
<li><a href="https://github.com/unclecode">GitHub @unclecode</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
<div class="footer-bottom">
|
||||
<p>Made with 🚀 by the Crawl4AI team</p>
|
||||
</div>
|
||||
</footer>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script>
|
||||
// Tool Selector Interaction
|
||||
document.querySelectorAll('.tool-selector').forEach(selector => {
|
||||
selector.addEventListener('click', function() {
|
||||
// Remove active class from all selectors
|
||||
document.querySelectorAll('.tool-selector').forEach(s => s.classList.remove('active'));
|
||||
document.querySelectorAll('.tool-content').forEach(c => c.classList.remove('active'));
|
||||
|
||||
// Add active class to clicked selector
|
||||
this.classList.add('active');
|
||||
|
||||
// Show corresponding content
|
||||
const toolId = this.getAttribute('data-tool');
|
||||
const contentElement = document.getElementById(toolId);
|
||||
if (contentElement) {
|
||||
contentElement.classList.add('active');
|
||||
}
|
||||
});
|
||||
});
|
||||
|
||||
// Code Tab Interaction
|
||||
document.querySelectorAll('.code-tab').forEach(tab => {
|
||||
tab.addEventListener('click', function() {
|
||||
// Remove active class from all tabs
|
||||
document.querySelectorAll('.code-tab').forEach(t => t.classList.remove('active'));
|
||||
document.querySelectorAll('.code-example').forEach(e => e.classList.remove('active'));
|
||||
|
||||
// Add active class to clicked tab
|
||||
this.classList.add('active');
|
||||
|
||||
// Show corresponding code
|
||||
const exampleId = this.getAttribute('data-example');
|
||||
document.getElementById('code-' + exampleId).classList.add('active');
|
||||
});
|
||||
});
|
||||
|
||||
// Copy Button Functionality
|
||||
document.querySelectorAll('.copy-button').forEach(button => {
|
||||
button.addEventListener('click', async function() {
|
||||
const codeType = this.getAttribute('data-code');
|
||||
let codeText = '';
|
||||
|
||||
// Handle different code types
|
||||
if (codeType === 'schema-python') {
|
||||
const codeElement = document.querySelector('#code-schema .terminal-window:first-child pre code');
|
||||
codeText = codeElement.textContent;
|
||||
} else if (codeType === 'schema-json') {
|
||||
const codeElement = document.querySelector('#code-schema .terminal-window:last-child pre code');
|
||||
codeText = codeElement.textContent;
|
||||
} else {
|
||||
const codeElement = document.getElementById('code-' + codeType).querySelector('pre code');
|
||||
codeText = codeElement.textContent;
|
||||
}
|
||||
|
||||
try {
|
||||
await navigator.clipboard.writeText(codeText);
|
||||
this.textContent = 'Copied!';
|
||||
this.classList.add('copied');
|
||||
|
||||
setTimeout(() => {
|
||||
this.textContent = 'Copy';
|
||||
this.classList.remove('copied');
|
||||
}, 2000);
|
||||
} catch (err) {
|
||||
console.error('Failed to copy code:', err);
|
||||
}
|
||||
});
|
||||
});
|
||||
|
||||
// Crawl4AI Cloud Interactive Demo
|
||||
const joinWaitlistBtn = document.getElementById('joinWaitlist');
|
||||
const signupOverlay = document.getElementById('signupOverlay');
|
||||
const closeSignupBtn = document.getElementById('closeSignup');
|
||||
const waitlistForm = document.getElementById('waitlistForm');
|
||||
const signupForm = document.getElementById('signupForm');
|
||||
const crawlAnimation = document.getElementById('crawlAnimation');
|
||||
const crawlOutput = document.getElementById('crawlOutput');
|
||||
const extractedPreview = document.getElementById('extractedPreview');
|
||||
const jsonOutput = document.getElementById('jsonOutput');
|
||||
const successMessage = document.getElementById('successMessage');
|
||||
const continueBtn = document.getElementById('continueBtn');
|
||||
const userEmailDisplay = document.getElementById('userEmailDisplay');
|
||||
|
||||
// Open signup modal
|
||||
joinWaitlistBtn.addEventListener('click', () => {
|
||||
signupOverlay.classList.add('active');
|
||||
});
|
||||
|
||||
// Banner button
|
||||
const joinWaitlistBannerBtn = document.getElementById('joinWaitlistBanner');
|
||||
if (joinWaitlistBannerBtn) {
|
||||
joinWaitlistBannerBtn.addEventListener('click', () => {
|
||||
signupOverlay.classList.add('active');
|
||||
});
|
||||
}
|
||||
|
||||
// Close signup modal
|
||||
closeSignupBtn.addEventListener('click', () => {
|
||||
signupOverlay.classList.remove('active');
|
||||
});
|
||||
|
||||
// Close on overlay click
|
||||
signupOverlay.addEventListener('click', (e) => {
|
||||
if (e.target === signupOverlay) {
|
||||
signupOverlay.classList.remove('active');
|
||||
}
|
||||
});
|
||||
|
||||
// Continue button
|
||||
if (continueBtn) {
|
||||
continueBtn.addEventListener('click', () => {
|
||||
signupOverlay.classList.remove('active');
|
||||
// Reset form for next time
|
||||
waitlistForm.reset();
|
||||
signupForm.style.display = 'block';
|
||||
crawlAnimation.style.display = 'none';
|
||||
extractedPreview.style.display = 'none';
|
||||
successMessage.style.display = 'none';
|
||||
});
|
||||
}
|
||||
|
||||
// Form submission with crawling animation
|
||||
waitlistForm.addEventListener('submit', async (e) => {
|
||||
e.preventDefault();
|
||||
|
||||
// Get form data
|
||||
const formData = {
|
||||
name: document.getElementById('userName').value,
|
||||
email: document.getElementById('userEmail').value,
|
||||
company: document.getElementById('userCompany').value || 'Not specified',
|
||||
useCase: document.getElementById('useCase').value || 'General web scraping',
|
||||
timestamp: new Date().toISOString(),
|
||||
source: 'Crawl4AI Assistant Landing Page'
|
||||
};
|
||||
|
||||
// Update email display
|
||||
userEmailDisplay.textContent = formData.email;
|
||||
|
||||
// Hide form and show crawling animation
|
||||
signupForm.style.display = 'none';
|
||||
crawlAnimation.style.display = 'block';
|
||||
|
||||
// Clear previous output
|
||||
const codeElement = crawlOutput.querySelector('code');
|
||||
codeElement.innerHTML = '$ crawl4ai cloud extract --url "signup-form" --auto-detect\n\n';
|
||||
|
||||
// Simulate crawling process with proper C4AI log format
|
||||
const crawlSteps = [
|
||||
{
|
||||
log: '<span class="log-init">[INIT]....</span> → Crawl4AI Cloud 1.0.0',
|
||||
time: '0.12s'
|
||||
},
|
||||
{
|
||||
log: '<span class="log-fetch">[FETCH]...</span> ↓ https://crawl4ai.com/waitlist-form',
|
||||
time: '0.45s'
|
||||
},
|
||||
{
|
||||
log: '<span class="log-scrape">[SCRAPE]..</span> ◆ https://crawl4ai.com/waitlist-form',
|
||||
time: '0.28s'
|
||||
},
|
||||
{
|
||||
log: '<span class="log-extract">[EXTRACT].</span> ■ Extracting form data with auto-detect',
|
||||
time: '0.55s'
|
||||
},
|
||||
{
|
||||
log: '<span class="log-complete">[COMPLETE]</span> ● https://crawl4ai.com/waitlist-form',
|
||||
time: '1.40s'
|
||||
}
|
||||
];
|
||||
|
||||
let stepIndex = 0;
|
||||
const typeStep = async () => {
|
||||
if (stepIndex < crawlSteps.length) {
|
||||
const step = crawlSteps[stepIndex];
|
||||
codeElement.innerHTML += step.log + ' | <span class="log-success">✓</span> | <span class="log-time">⏱: ' + step.time + '</span>\n';
|
||||
stepIndex++;
|
||||
|
||||
// Scroll to bottom
|
||||
const terminal = crawlOutput.parentElement;
|
||||
terminal.scrollTop = terminal.scrollHeight;
|
||||
|
||||
setTimeout(typeStep, 600);
|
||||
} else {
|
||||
// Show extracted data
|
||||
setTimeout(() => {
|
||||
codeElement.innerHTML += '\n<span class="log-success">[UPLOAD]..</span> ↑ Uploading to Crawl4AI Cloud...';
|
||||
|
||||
setTimeout(() => {
|
||||
extractedPreview.style.display = 'block';
|
||||
jsonOutput.textContent = JSON.stringify(formData, null, 2);
|
||||
|
||||
// Add syntax highlighting
|
||||
jsonOutput.innerHTML = jsonOutput.textContent
|
||||
.replace(/"([^"]+)":/g, '<span class="string">"$1"</span>:')
|
||||
.replace(/: "([^"]+)"/g, ': <span class="string">"$1"</span>');
|
||||
|
||||
codeElement.innerHTML += ' | <span class="log-success">✓</span> | <span class="log-time">⏱: 0.23s</span>\n';
|
||||
codeElement.innerHTML += '\n<span class="log-success">[SUCCESS]</span> ✨ Data uploaded successfully!';
|
||||
|
||||
// Show success message after a delay
|
||||
setTimeout(() => {
|
||||
successMessage.style.display = 'block';
|
||||
|
||||
// Smooth scroll to bottom to show success message
|
||||
setTimeout(() => {
|
||||
const container = document.getElementById('signupContainer');
|
||||
container.scrollTo({
|
||||
top: container.scrollHeight,
|
||||
behavior: 'smooth'
|
||||
});
|
||||
}, 100);
|
||||
|
||||
// Actually submit to waiting list (you can implement this)
|
||||
console.log('Waitlist submission:', formData);
|
||||
}, 1500);
|
||||
}, 800);
|
||||
}, 600);
|
||||
}
|
||||
};
|
||||
|
||||
// Start the animation
|
||||
setTimeout(typeStep, 500);
|
||||
});
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
69
docs/md_v2/apps/crawl4ai-assistant/libs/marked.min.js
vendored
Normal file
54
docs/md_v2/apps/crawl4ai-assistant/manifest.json
Normal file
@@ -0,0 +1,54 @@
|
||||
{
|
||||
"manifest_version": 3,
|
||||
"name": "Crawl4AI Assistant",
|
||||
"version": "1.3.0",
|
||||
"description": "Visual schema and script builder for Crawl4AI - Build extraction schemas and automation scripts by clicking and recording actions",
|
||||
"permissions": [
|
||||
"activeTab",
|
||||
"storage",
|
||||
"downloads"
|
||||
],
|
||||
"host_permissions": [
|
||||
"<all_urls>"
|
||||
],
|
||||
"action": {
|
||||
"default_popup": "popup/popup.html",
|
||||
"default_icon": {
|
||||
"16": "icons/icon-16.png",
|
||||
"48": "icons/icon-48.png",
|
||||
"128": "icons/icon-128.png"
|
||||
}
|
||||
},
|
||||
"content_scripts": [
|
||||
{
|
||||
"matches": ["<all_urls>"],
|
||||
"js": [
|
||||
"libs/marked.min.js",
|
||||
"content/shared/utils.js",
|
||||
"content/markdownPreviewModal.js",
|
||||
"content/click2crawl.js",
|
||||
"content/scriptBuilder.js",
|
||||
"content/contentAnalyzer.js",
|
||||
"content/markdownConverter.js",
|
||||
"content/markdownExtraction.js",
|
||||
"content/content.js"
|
||||
],
|
||||
"css": ["content/overlay.css"],
|
||||
"run_at": "document_idle"
|
||||
}
|
||||
],
|
||||
"background": {
|
||||
"service_worker": "background/service-worker.js"
|
||||
},
|
||||
"icons": {
|
||||
"16": "icons/icon-16.png",
|
||||
"48": "icons/icon-48.png",
|
||||
"128": "icons/icon-128.png"
|
||||
},
|
||||
"web_accessible_resources": [
|
||||
{
|
||||
"resources": ["icons/*", "assets/*"],
|
||||
"matches": ["<all_urls>"]
|
||||
}
|
||||
]
|
||||
}
|
||||
BIN
docs/md_v2/apps/crawl4ai-assistant/popup/icons/favicon.ico
Normal file
|
After Width: | Height: | Size: 3.4 KiB |
BIN
docs/md_v2/apps/crawl4ai-assistant/popup/icons/icon-128.png
Normal file
|
After Width: | Height: | Size: 1.6 KiB |
BIN
docs/md_v2/apps/crawl4ai-assistant/popup/icons/icon-16.png
Normal file
|
After Width: | Height: | Size: 1.6 KiB |
BIN
docs/md_v2/apps/crawl4ai-assistant/popup/icons/icon-48.png
Normal file
|
After Width: | Height: | Size: 1.6 KiB |
330
docs/md_v2/apps/crawl4ai-assistant/popup/popup.css
Normal file
@@ -0,0 +1,330 @@
|
||||
/* Font Face Definitions */
|
||||
@font-face {
|
||||
font-family: 'Dank Mono';
|
||||
src: url('../assets/DankMono-Regular.woff2') format('woff2');
|
||||
font-weight: 400;
|
||||
font-style: normal;
|
||||
font-display: swap;
|
||||
}
|
||||
|
||||
@font-face {
|
||||
font-family: 'Dank Mono';
|
||||
src: url('../assets/DankMono-Bold.woff2') format('woff2');
|
||||
font-weight: 700;
|
||||
font-style: normal;
|
||||
font-display: swap;
|
||||
}
|
||||
|
||||
@font-face {
|
||||
font-family: 'Dank Mono';
|
||||
src: url('../assets/DankMono-Italic.woff2') format('woff2');
|
||||
font-weight: 400;
|
||||
font-style: italic;
|
||||
font-display: swap;
|
||||
}
|
||||
|
||||
:root {
|
||||
--font-primary: 'Dank Mono', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, monospace;
|
||||
}
|
||||
|
||||
* {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
body {
|
||||
width: 380px;
|
||||
font-family: var(--font-primary);
|
||||
background: #0a0a0a;
|
||||
color: #e0e0e0;
|
||||
border-radius: 16px;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.popup-container {
|
||||
padding: 20px;
|
||||
}
|
||||
|
||||
header {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 16px;
|
||||
margin-bottom: 20px;
|
||||
padding-bottom: 16px;
|
||||
border-bottom: 1px solid #2a2a2a;
|
||||
}
|
||||
|
||||
.logo {
|
||||
width: 48px;
|
||||
height: 48px;
|
||||
flex-shrink: 0;
|
||||
}
|
||||
|
||||
.header-content {
|
||||
flex: 1;
|
||||
}
|
||||
|
||||
header h1 {
|
||||
font-size: 20px;
|
||||
font-weight: 600;
|
||||
color: #fff;
|
||||
margin: 0 0 4px 0;
|
||||
}
|
||||
|
||||
.header-stats {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 12px;
|
||||
}
|
||||
|
||||
.github-stars {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 6px;
|
||||
color: #999;
|
||||
text-decoration: none;
|
||||
font-size: 13px;
|
||||
transition: color 0.2s ease;
|
||||
}
|
||||
|
||||
.github-stars:hover {
|
||||
color: #4a9eff;
|
||||
}
|
||||
|
||||
.github-icon {
|
||||
flex-shrink: 0;
|
||||
}
|
||||
|
||||
.mode-selector {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
gap: 12px;
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
|
||||
.mode-button {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 16px;
|
||||
padding: 16px;
|
||||
background: #1a1a1a;
|
||||
border: 2px solid #2a2a2a;
|
||||
border-radius: 12px;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s ease;
|
||||
width: 100%;
|
||||
text-align: left;
|
||||
}
|
||||
|
||||
.mode-button:hover:not(:disabled) {
|
||||
background: #252525;
|
||||
border-color: #4a9eff;
|
||||
transform: translateY(-2px);
|
||||
}
|
||||
|
||||
.mode-button:disabled {
|
||||
opacity: 0.5;
|
||||
cursor: not-allowed;
|
||||
}
|
||||
|
||||
.mode-button .icon {
|
||||
font-size: 32px;
|
||||
width: 48px;
|
||||
height: 48px;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
background: #252525;
|
||||
border-radius: 8px;
|
||||
}
|
||||
|
||||
.mode-button.schema .icon {
|
||||
background: #1e3a5f;
|
||||
}
|
||||
|
||||
.mode-button.script .icon {
|
||||
background: #3a1e5f;
|
||||
}
|
||||
|
||||
.mode-button.c2c .icon {
|
||||
background: #1e5f3a;
|
||||
}
|
||||
|
||||
.mode-info h3 {
|
||||
font-size: 16px;
|
||||
color: #fff;
|
||||
margin-bottom: 4px;
|
||||
}
|
||||
|
||||
.mode-info p {
|
||||
font-size: 13px;
|
||||
color: #999;
|
||||
line-height: 1.4;
|
||||
}
|
||||
|
||||
.active-session {
|
||||
background: #1a1a1a;
|
||||
border: 2px solid #4a9eff;
|
||||
border-radius: 12px;
|
||||
padding: 16px;
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
|
||||
.active-session.hidden {
|
||||
display: none;
|
||||
}
|
||||
|
||||
.session-header {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 8px;
|
||||
margin-bottom: 12px;
|
||||
}
|
||||
|
||||
.status-dot {
|
||||
width: 8px;
|
||||
height: 8px;
|
||||
background: #4a9eff;
|
||||
border-radius: 50%;
|
||||
animation: pulse 2s infinite;
|
||||
}
|
||||
|
||||
@keyframes pulse {
|
||||
0%, 100% { opacity: 1; }
|
||||
50% { opacity: 0.5; }
|
||||
}
|
||||
|
||||
.session-title {
|
||||
font-size: 14px;
|
||||
font-weight: 600;
|
||||
color: #4a9eff;
|
||||
}
|
||||
|
||||
.session-stats {
|
||||
display: flex;
|
||||
gap: 20px;
|
||||
margin-bottom: 16px;
|
||||
padding: 12px;
|
||||
background: #0a0a0a;
|
||||
border-radius: 8px;
|
||||
}
|
||||
|
||||
.stat {
|
||||
flex: 1;
|
||||
}
|
||||
|
||||
.stat-label {
|
||||
display: block;
|
||||
font-size: 11px;
|
||||
color: #666;
|
||||
text-transform: uppercase;
|
||||
margin-bottom: 4px;
|
||||
}
|
||||
|
||||
.stat-value {
|
||||
font-size: 14px;
|
||||
color: #fff;
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
.session-actions {
|
||||
display: flex;
|
||||
gap: 8px;
|
||||
}
|
||||
|
||||
.action-button {
|
||||
flex: 1;
|
||||
padding: 10px 16px;
|
||||
border: none;
|
||||
border-radius: 8px;
|
||||
font-size: 14px;
|
||||
font-weight: 600;
|
||||
cursor: pointer;
|
||||
transition: all 0.2s ease;
|
||||
}
|
||||
|
||||
.action-button.primary {
|
||||
background: #4a9eff;
|
||||
color: #000;
|
||||
}
|
||||
|
||||
.action-button.primary:hover:not(:disabled) {
|
||||
background: #3a8eef;
|
||||
transform: translateY(-1px);
|
||||
}
|
||||
|
||||
.action-button.primary:disabled {
|
||||
background: #2a4a7f;
|
||||
color: #666;
|
||||
cursor: not-allowed;
|
||||
}
|
||||
|
||||
.action-button.secondary {
|
||||
background: #2a2a2a;
|
||||
color: #fff;
|
||||
}
|
||||
|
||||
.action-button.secondary:hover {
|
||||
background: #3a3a3a;
|
||||
}
|
||||
|
||||
.instructions {
|
||||
background: #1a1a1a;
|
||||
border-radius: 12px;
|
||||
padding: 16px;
|
||||
margin-bottom: 16px;
|
||||
}
|
||||
|
||||
.instructions h4 {
|
||||
font-size: 14px;
|
||||
margin-bottom: 12px;
|
||||
color: #fff;
|
||||
}
|
||||
|
||||
.instructions ol {
|
||||
padding-left: 20px;
|
||||
}
|
||||
|
||||
.instructions li {
|
||||
font-size: 13px;
|
||||
line-height: 1.6;
|
||||
color: #ccc;
|
||||
margin-bottom: 6px;
|
||||
}
|
||||
|
||||
footer {
|
||||
padding-top: 16px;
|
||||
border-top: 1px solid #2a2a2a;
|
||||
}
|
||||
|
||||
.social-links {
|
||||
display: flex;
|
||||
justify-content: center;
|
||||
gap: 16px;
|
||||
}
|
||||
|
||||
.social-link {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 6px;
|
||||
color: #999;
|
||||
text-decoration: none;
|
||||
font-size: 12px;
|
||||
transition: all 0.2s ease;
|
||||
padding: 6px 12px;
|
||||
border-radius: 6px;
|
||||
background: #1a1a1a;
|
||||
}
|
||||
|
||||
.social-link:hover {
|
||||
color: #0fbbaa;
|
||||
background: #2a2a2a;
|
||||
transform: translateY(-1px);
|
||||
}
|
||||
|
||||
.social-link svg {
|
||||
width: 16px;
|
||||
height: 16px;
|
||||
flex-shrink: 0;
|
||||
}
|
||||
111
docs/md_v2/apps/crawl4ai-assistant/popup/popup.html
Normal file
@@ -0,0 +1,111 @@
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<link rel="stylesheet" href="popup.css">
|
||||
</head>
|
||||
<body>
|
||||
<div class="popup-container">
|
||||
<header>
|
||||
<img src="icons/icon-48.png" class="logo" alt="Crawl4AI">
|
||||
<div class="header-content">
|
||||
<h1>Crawl4AI Assistant</h1>
|
||||
<div class="header-stats">
|
||||
<a href="https://github.com/unclecode/crawl4ai" target="_blank" class="github-stars">
|
||||
<svg class="github-icon" viewBox="0 0 16 16" width="16" height="16">
|
||||
<path fill="currentColor" d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0016 8c0-4.42-3.58-8-8-8z"></path>
|
||||
</svg>
|
||||
<span id="stars-count">Loading...</span>
|
||||
</a>
|
||||
</div>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
<div class="mode-selector">
|
||||
<button id="schema-mode" class="mode-button schema">
|
||||
<div class="icon">🎯</div>
|
||||
<div class="mode-info">
|
||||
<h3>Click2Crawl</h3>
|
||||
<p>Click elements to build extraction schemas</p>
|
||||
</div>
|
||||
</button>
|
||||
|
||||
<button id="script-mode" class="mode-button script">
|
||||
<div class="icon">🔴</div>
|
||||
<div class="mode-info">
|
||||
<h3>Script Builder <span style="color: #ff3c74; font-size: 10px;">(Alpha)</span></h3>
|
||||
<p>Record actions to build automation scripts</p>
|
||||
</div>
|
||||
</button>
|
||||
|
||||
<button id="c2c-mode" class="mode-button c2c">
|
||||
<div class="icon">📝</div>
|
||||
<div class="mode-info">
|
||||
<h3>Markdown Extraction</h3>
|
||||
<p>Select elements and convert to clean markdown</p>
|
||||
</div>
|
||||
</button>
|
||||
</div>
|
||||
|
||||
<div id="active-session" class="active-session hidden">
|
||||
<div class="session-header">
|
||||
<span class="status-dot"></span>
|
||||
<span class="session-title">Schema Capture Active</span>
|
||||
</div>
|
||||
<div class="session-stats">
|
||||
<div class="stat">
|
||||
<span class="stat-label">Container:</span>
|
||||
<span id="container-status" class="stat-value">Not selected</span>
|
||||
</div>
|
||||
<div class="stat">
|
||||
<span class="stat-label">Fields:</span>
|
||||
<span id="fields-count" class="stat-value">0</span>
|
||||
</div>
|
||||
</div>
|
||||
<div class="session-actions">
|
||||
<button id="generate-code" class="action-button primary" disabled>
|
||||
<span>Generate Code</span>
|
||||
</button>
|
||||
<button id="stop-capture" class="action-button secondary">
|
||||
<span>Stop Capture</span>
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="instructions" style="display: none;">
|
||||
<h4>How to use:</h4>
|
||||
<ol>
|
||||
<li>Click "Click2Crawl" to start</li>
|
||||
<li>Click on a container element (e.g., product card)</li>
|
||||
<li>Click individual fields inside and name them</li>
|
||||
<li>Generate Python code when done</li>
|
||||
</ol>
|
||||
</div>
|
||||
|
||||
<footer>
|
||||
<div class="social-links">
|
||||
<a href="https://docs.crawl4ai.com" target="_blank" class="social-link">
|
||||
<svg viewBox="0 0 24 24" width="16" height="16">
|
||||
<path fill="currentColor" d="M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm-2 15l-5-5 1.41-1.41L10 14.17l7.59-7.59L19 8l-9 9z"/>
|
||||
</svg>
|
||||
<span>Docs</span>
|
||||
</a>
|
||||
<a href="https://twitter.com/unclecode" target="_blank" class="social-link">
|
||||
<svg viewBox="0 0 24 24" width="16" height="16">
|
||||
<path fill="currentColor" d="M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-5.214-6.817L4.99 21.75H1.68l7.73-8.835L1.254 2.25H8.08l4.713 6.231zm-1.161 17.52h1.833L7.084 4.126H5.117z"/>
|
||||
</svg>
|
||||
<span>@unclecode</span>
|
||||
</a>
|
||||
<a href="https://discord.gg/jP8KfhDhyN" target="_blank" class="social-link">
|
||||
<svg viewBox="0 0 24 24" width="16" height="16">
|
||||
<path fill="currentColor" d="M19.27 5.33C17.94 4.71 16.5 4.26 15 4a.09.09 0 00-.07.03c-.18.33-.39.76-.53 1.09a16.09 16.09 0 00-4.8 0c-.14-.34-.35-.76-.54-1.09-.01-.02-.04-.03-.07-.03-1.5.26-2.93.71-4.27 1.33-.01 0-.02.01-.03.02-2.72 4.07-3.47 8.03-3.1 11.95 0 .02.01.04.03.05 1.8 1.32 3.53 2.12 5.24 2.65.03.01.06 0 .07-.02.4-.55.76-1.13 1.07-1.74.02-.04 0-.08-.04-.09-.57-.22-1.11-.48-1.64-.78-.04-.02-.04-.08-.01-.11.11-.08.22-.17.33-.25.02-.02.05-.02.07-.01 3.44 1.57 7.15 1.57 10.55 0 .02-.01.05-.01.07.01.11.09.22.17.33.26.04.03.04.09-.01.11-.52.31-1.07.56-1.64.78-.04.01-.05.06-.04.09.32.61.68 1.19 1.07 1.74.03.01.06.02.09.01 1.72-.53 3.45-1.33 5.25-2.65.02-.01.03-.03.03-.05.44-4.53-.73-8.46-3.1-11.95-.01-.01-.02-.02-.04-.02zM8.52 14.91c-1.03 0-1.89-.95-1.89-2.12s.84-2.12 1.89-2.12c1.06 0 1.9.96 1.89 2.12 0 1.17-.84 2.12-1.89 2.12zm6.97 0c-1.03 0-1.89-.95-1.89-2.12s.84-2.12 1.89-2.12c1.06 0 1.9.96 1.89 2.12 0 1.17-.83 2.12-1.89 2.12z"/>
|
||||
</svg>
|
||||
<span>Discord</span>
|
||||
</a>
|
||||
</div>
|
||||
</footer>
|
||||
</div>
|
||||
|
||||
<script src="popup.js"></script>
|
||||
</body>
|
||||
</html>
|
||||
146
docs/md_v2/apps/crawl4ai-assistant/popup/popup.js
Normal file
@@ -0,0 +1,146 @@
|
||||
// Popup script for Crawl4AI Assistant
|
||||
let activeMode = null;
|
||||
|
||||
document.addEventListener('DOMContentLoaded', () => {
|
||||
// Fetch GitHub stars
|
||||
fetchGitHubStars();
|
||||
|
||||
// Check current state
|
||||
chrome.storage.local.get(['captureMode', 'captureStats'], (data) => {
|
||||
if (data.captureMode) {
|
||||
activeMode = data.captureMode;
|
||||
showActiveSession(data.captureStats || {});
|
||||
}
|
||||
});
|
||||
|
||||
// Mode buttons
|
||||
document.getElementById('schema-mode').addEventListener('click', () => {
|
||||
startSchemaCapture();
|
||||
});
|
||||
|
||||
document.getElementById('script-mode').addEventListener('click', () => {
|
||||
startScriptCapture();
|
||||
});
|
||||
|
||||
document.getElementById('c2c-mode').addEventListener('click', () => {
|
||||
startClick2Crawl();
|
||||
});
|
||||
|
||||
// Session actions
|
||||
document.getElementById('generate-code').addEventListener('click', () => {
|
||||
generateCode();
|
||||
});
|
||||
|
||||
document.getElementById('stop-capture').addEventListener('click', () => {
|
||||
stopCapture();
|
||||
});
|
||||
});
|
||||
|
||||
async function fetchGitHubStars() {
|
||||
try {
|
||||
const response = await fetch('https://api.github.com/repos/unclecode/crawl4ai');
|
||||
const data = await response.json();
|
||||
const stars = data.stargazers_count;
|
||||
|
||||
// Format the number (e.g., 1.2k)
|
||||
let formattedStars;
|
||||
if (stars >= 1000) {
|
||||
formattedStars = (stars / 1000).toFixed(1) + 'k';
|
||||
} else {
|
||||
formattedStars = stars.toString();
|
||||
}
|
||||
|
||||
document.getElementById('stars-count').textContent = `⭐ ${formattedStars}`;
|
||||
} catch (error) {
|
||||
console.error('Failed to fetch GitHub stars:', error);
|
||||
document.getElementById('stars-count').textContent = '⭐ 2k+';
|
||||
}
|
||||
}
|
||||
|
||||
function startSchemaCapture() {
|
||||
chrome.tabs.query({ active: true, currentWindow: true }, (tabs) => {
|
||||
chrome.tabs.sendMessage(tabs[0].id, {
|
||||
action: 'startSchemaCapture'
|
||||
}, (response) => {
|
||||
if (response && response.success) {
|
||||
// Close the popup to let user interact with the page
|
||||
window.close();
|
||||
}
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
function startScriptCapture() {
|
||||
chrome.tabs.query({ active: true, currentWindow: true }, (tabs) => {
|
||||
chrome.tabs.sendMessage(tabs[0].id, {
|
||||
action: 'startScriptCapture'
|
||||
}, (response) => {
|
||||
if (response && response.success) {
|
||||
// Close the popup to let user interact with the page
|
||||
window.close();
|
||||
}
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
function startClick2Crawl() {
|
||||
chrome.tabs.query({ active: true, currentWindow: true }, (tabs) => {
|
||||
chrome.tabs.sendMessage(tabs[0].id, {
|
||||
action: 'startClick2Crawl'
|
||||
}, (response) => {
|
||||
if (response && response.success) {
|
||||
// Close the popup to let user interact with the page
|
||||
window.close();
|
||||
}
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
function showActiveSession(stats) {
|
||||
document.querySelector('.mode-selector').style.display = 'none';
|
||||
document.getElementById('active-session').classList.remove('hidden');
|
||||
|
||||
updateSessionStats(stats);
|
||||
}
|
||||
|
||||
function updateSessionStats(stats) {
|
||||
document.getElementById('container-status').textContent =
|
||||
stats.container ? 'Selected ✓' : 'Not selected';
|
||||
document.getElementById('fields-count').textContent = stats.fields || 0;
|
||||
|
||||
// Enable generate button if we have container and fields
|
||||
document.getElementById('generate-code').disabled =
|
||||
!stats.container || stats.fields === 0;
|
||||
}
|
||||
|
||||
function generateCode() {
|
||||
chrome.tabs.query({ active: true, currentWindow: true }, (tabs) => {
|
||||
chrome.tabs.sendMessage(tabs[0].id, {
|
||||
action: 'generateCode'
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
function stopCapture() {
|
||||
chrome.tabs.query({ active: true, currentWindow: true }, (tabs) => {
|
||||
chrome.tabs.sendMessage(tabs[0].id, {
|
||||
action: 'stopCapture'
|
||||
}, () => {
|
||||
// Reset UI
|
||||
document.querySelector('.mode-selector').style.display = 'flex';
|
||||
document.getElementById('active-session').classList.add('hidden');
|
||||
activeMode = null;
|
||||
|
||||
// Clear storage
|
||||
chrome.storage.local.remove(['captureMode', 'captureStats']);
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
// Listen for stats updates from content script
|
||||
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
|
||||
if (message.action === 'updateStats') {
|
||||
updateSessionStats(message.stats);
|
||||
chrome.storage.local.set({ captureStats: message.stats });
|
||||
}
|
||||
});
|
||||
302
docs/md_v2/apps/index.md
Normal file
@@ -0,0 +1,302 @@
|
||||
# 🚀 Crawl4AI Interactive Apps
|
||||
|
||||
Welcome to the Crawl4AI Apps Hub - your gateway to interactive tools and demos that make web scraping more intuitive and powerful.
|
||||
|
||||
<style>
|
||||
.apps-container {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fit, minmax(350px, 1fr));
|
||||
gap: 2rem;
|
||||
margin: 2rem 0;
|
||||
}
|
||||
|
||||
.app-card {
|
||||
background: #3f3f44;
|
||||
border: 1px solid #3f3f44;
|
||||
border-radius: 8px;
|
||||
padding: 1.5rem;
|
||||
transition: all 0.3s ease;
|
||||
position: relative;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.app-card:hover {
|
||||
transform: translateY(-4px);
|
||||
box-shadow: 0 8px 16px rgba(0, 0, 0, 0.3);
|
||||
border-color: #50ffff;
|
||||
}
|
||||
|
||||
.app-card h3 {
|
||||
margin-top: 0;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 0.5rem;
|
||||
color: #e8e9ed;
|
||||
}
|
||||
|
||||
.app-status {
|
||||
display: inline-block;
|
||||
padding: 0.25rem 0.75rem;
|
||||
border-radius: 20px;
|
||||
font-size: 0.7rem;
|
||||
font-weight: 600;
|
||||
text-transform: uppercase;
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
|
||||
.status-available {
|
||||
background: #50ffff;
|
||||
color: #070708;
|
||||
}
|
||||
|
||||
.status-beta {
|
||||
background: #f59e0b;
|
||||
color: #070708;
|
||||
}
|
||||
|
||||
.status-coming-soon {
|
||||
background: #2a2a2a;
|
||||
color: #888;
|
||||
}
|
||||
|
||||
.app-description {
|
||||
margin: 1rem 0;
|
||||
line-height: 1.6;
|
||||
color: #a3abba;
|
||||
}
|
||||
|
||||
.app-features {
|
||||
list-style: none;
|
||||
padding: 0;
|
||||
margin: 1rem 0;
|
||||
}
|
||||
|
||||
.app-features li {
|
||||
padding-left: 1.5rem;
|
||||
position: relative;
|
||||
margin-bottom: 0.5rem;
|
||||
color: #d5cec0;
|
||||
font-size: 0.9rem;
|
||||
}
|
||||
|
||||
.app-features li:before {
|
||||
content: "▸";
|
||||
position: absolute;
|
||||
left: 0;
|
||||
color: #50ffff;
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
.app-action {
|
||||
margin-top: 1.5rem;
|
||||
}
|
||||
|
||||
.app-btn {
|
||||
display: inline-block;
|
||||
padding: 0.75rem 1.5rem;
|
||||
background: #50ffff;
|
||||
color: #070708;
|
||||
text-decoration: none;
|
||||
border-radius: 6px;
|
||||
font-weight: 600;
|
||||
transition: all 0.2s ease;
|
||||
font-family: dm, Monaco, monospace;
|
||||
}
|
||||
|
||||
.app-btn:hover {
|
||||
background: #09b5a5;
|
||||
transform: scale(1.05);
|
||||
color: #070708;
|
||||
}
|
||||
|
||||
.app-btn.disabled {
|
||||
background: #2a2a2a;
|
||||
color: #666;
|
||||
cursor: not-allowed;
|
||||
transform: none;
|
||||
}
|
||||
|
||||
.app-btn.disabled:hover {
|
||||
background: #2a2a2a;
|
||||
transform: none;
|
||||
}
|
||||
|
||||
.intro-section {
|
||||
background: #3f3f44;
|
||||
border-radius: 8px;
|
||||
padding: 2rem;
|
||||
margin-bottom: 3rem;
|
||||
border: 1px solid #3f3f44;
|
||||
}
|
||||
|
||||
.intro-section h2 {
|
||||
margin-top: 0;
|
||||
color: #50ffff;
|
||||
}
|
||||
|
||||
.intro-section p {
|
||||
color: #d5cec0;
|
||||
}
|
||||
</style>
|
||||
|
||||
<div class="intro-section">
|
||||
<h2>🛠️ Interactive Tools for Modern Web Scraping</h2>
|
||||
<p>
|
||||
Our apps are designed to make Crawl4AI more accessible and powerful. Whether you're learning browser automation, designing extraction strategies, or building complex scrapers, these tools provide visual, interactive ways to work with Crawl4AI's features.
|
||||
</p>
|
||||
</div>
|
||||
|
||||
## 🎯 Available Apps
|
||||
|
||||
<div class="apps-container">
|
||||
|
||||
<div class="app-card">
|
||||
<span class="app-status status-available">Available</span>
|
||||
<h3>🎨 C4A-Script Interactive Editor</h3>
|
||||
<p class="app-description">
|
||||
A visual, block-based programming environment for creating browser automation scripts. Perfect for beginners and experts alike!
|
||||
</p>
|
||||
<ul class="app-features">
|
||||
<li>Drag-and-drop visual programming</li>
|
||||
<li>Real-time JavaScript generation</li>
|
||||
<li>Interactive tutorials</li>
|
||||
<li>Export to C4A-Script or JavaScript</li>
|
||||
<li>Live preview capabilities</li>
|
||||
</ul>
|
||||
<div class="app-action">
|
||||
<a href="c4a-script/" class="app-btn" target="_blank">Launch Editor →</a>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="app-card">
|
||||
<span class="app-status status-available">Available</span>
|
||||
<h3>🧠 LLM Context Builder</h3>
|
||||
<p class="app-description">
|
||||
Generate optimized context files for your favorite LLM when working with Crawl4AI. Get focused, relevant documentation based on your needs.
|
||||
</p>
|
||||
<ul class="app-features">
|
||||
<li>Modular context generation</li>
|
||||
<li>Memory, reasoning & examples perspectives</li>
|
||||
<li>Component-based selection</li>
|
||||
<li>Vibe coding preset</li>
|
||||
<li>Download custom contexts</li>
|
||||
</ul>
|
||||
<div class="app-action">
|
||||
<a href="llmtxt/" class="app-btn" target="_blank">Launch Builder →</a>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="app-card">
|
||||
<span class="app-status status-coming-soon">Coming Soon</span>
|
||||
<h3>🕸️ Web Scraping Playground</h3>
|
||||
<p class="app-description">
|
||||
Test your scraping strategies on real websites with instant feedback. See how different configurations affect your results.
|
||||
</p>
|
||||
<ul class="app-features">
|
||||
<li>Live website testing</li>
|
||||
<li>Side-by-side result comparison</li>
|
||||
<li>Performance metrics</li>
|
||||
<li>Export configurations</li>
|
||||
</ul>
|
||||
<div class="app-action">
|
||||
<a href="#" class="app-btn disabled">Coming Soon</a>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="app-card">
|
||||
<span class="app-status status-available">Available</span>
|
||||
<h3>🔍 Crawl4AI Assistant (Chrome Extension)</h3>
|
||||
<p class="app-description">
|
||||
Visual schema builder Chrome extension - click on webpage elements to generate extraction schemas and Python code!
|
||||
</p>
|
||||
<ul class="app-features">
|
||||
<li>Visual element selection</li>
|
||||
<li>Container & field selection modes</li>
|
||||
<li>Smart selector generation</li>
|
||||
<li>Complete Python code generation</li>
|
||||
<li>One-click installation</li>
|
||||
</ul>
|
||||
<div class="app-action">
|
||||
<a href="crawl4ai-assistant/" class="app-btn">Install Extension →</a>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="app-card">
|
||||
<span class="app-status status-coming-soon">Coming Soon</span>
|
||||
<h3>🧪 Extraction Lab</h3>
|
||||
<p class="app-description">
|
||||
Experiment with different extraction strategies and see how they perform on your content. Compare LLM vs CSS vs XPath approaches.
|
||||
</p>
|
||||
<ul class="app-features">
|
||||
<li>Strategy comparison tools</li>
|
||||
<li>Performance benchmarks</li>
|
||||
<li>Cost estimation for LLM strategies</li>
|
||||
<li>Best practice recommendations</li>
|
||||
</ul>
|
||||
<div class="app-action">
|
||||
<a href="#" class="app-btn disabled">Coming Soon</a>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="app-card">
|
||||
<span class="app-status status-coming-soon">Coming Soon</span>
|
||||
<h3>🤖 AI Prompt Designer</h3>
|
||||
<p class="app-description">
|
||||
Craft and test prompts for LLM-based extraction. See how different prompts affect extraction quality and costs.
|
||||
</p>
|
||||
<ul class="app-features">
|
||||
<li>Prompt templates library</li>
|
||||
<li>A/B testing interface</li>
|
||||
<li>Token usage calculator</li>
|
||||
<li>Quality metrics</li>
|
||||
</ul>
|
||||
<div class="app-action">
|
||||
<a href="#" class="app-btn disabled">Coming Soon</a>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="app-card">
|
||||
<span class="app-status status-coming-soon">Coming Soon</span>
|
||||
<h3>📊 Crawl Monitor</h3>
|
||||
<p class="app-description">
|
||||
Real-time monitoring dashboard for your crawling operations. Track performance, debug issues, and optimize your scrapers.
|
||||
</p>
|
||||
<ul class="app-features">
|
||||
<li>Real-time crawl statistics</li>
|
||||
<li>Error tracking and debugging</li>
|
||||
<li>Resource usage monitoring</li>
|
||||
<li>Historical analytics</li>
|
||||
</ul>
|
||||
<div class="app-action">
|
||||
<a href="#" class="app-btn disabled">Coming Soon</a>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
## 🚀 Why Use These Apps?
|
||||
|
||||
### 🎯 **Accelerate Learning**
|
||||
Visual tools help you understand Crawl4AI's concepts faster than reading documentation alone.
|
||||
|
||||
### 💡 **Reduce Development Time**
|
||||
Generate working code instantly instead of writing everything from scratch.
|
||||
|
||||
### 🔍 **Improve Quality**
|
||||
Test and refine your approach before deploying to production.
|
||||
|
||||
### 🤝 **Community Driven**
|
||||
These tools are built based on user feedback. Have an idea? [Let us know](https://github.com/unclecode/crawl4ai/issues)!
|
||||
|
||||
## 📢 Stay Updated
|
||||
|
||||
Want to know when new apps are released?
|
||||
|
||||
- ⭐ [Star us on GitHub](https://github.com/unclecode/crawl4ai) to get notifications
|
||||
- 🐦 Follow [@unclecode](https://twitter.com/unclecode) for announcements
|
||||
- 💬 Join our [Discord community](https://discord.gg/crawl4ai) for early access
|
||||
|
||||
---
|
||||
|
||||
!!! tip "Developer Resources"
|
||||
Building your own tools with Crawl4AI? Check out our [API Reference](../api/async-webcrawler.md) and [Integration Guide](../advanced/advanced-features.md) for comprehensive documentation.
|
||||
75
docs/md_v2/apps/llmtxt/build.md
Normal file
@@ -0,0 +1,75 @@
|
||||
|
||||
O**Prompt for AI Coding Assistant: Create an Interactive LLM Context Builder Page**
|
||||
|
||||
**Objective:**
|
||||
|
||||
Your task is to create an interactive HTML webpage with JavaScript functionality that allows users to select and combine different `crawl4ai` LLM context files into a single downloadable Markdown (`.md`) file. This tool will empower users to craft tailored context for their AI assistants based on their specific needs.
|
||||
|
||||
**Core Functionality:**
|
||||
|
||||
1. **Display `crawl4ai` Components:** The page will list all available `crawl4ai` documentation components.
|
||||
2. **Select Context Types:** For each component, users can select which types of context they want to include:
|
||||
* Memory (API facts)
|
||||
* Reasoning (How-to/why)
|
||||
* Examples (Code snippets)
|
||||
(All should be selected by default for each initially selected component).
|
||||
3. **Special "Aggregate" Contexts:** Include options for special, pre-combined contexts:
|
||||
* "Vibe Coding" (a curated mix for general AI prompting)
|
||||
* "All Library Context" (a comprehensive aggregation of all memory, reasoning, and examples for the entire library).
|
||||
4. **Fetch and Concatenate:** When the user clicks a "Download Combined Context" button:
|
||||
* The JavaScript will fetch the content of all selected Markdown files from the server (from a predefined folder, e.g., `/llmtxt/`).
|
||||
* It will concatenate the content of these files into a single string.
|
||||
5. **Client-Side Download:** The concatenated content will be offered to the user as a download (e.g., `custom_crawl4ai_context.md`).
|
||||
|
||||
**Input/Assumptions:**
|
||||
|
||||
* **Context Files Location:** All individual context Markdown files are located on the server in a publicly accessible folder named `llmtxt/`.
|
||||
* **File Naming Convention:** Files follow the pattern: `crawl4ai_{{component_name}}_[memory|reasoning|examples]_content.llm.md`.
|
||||
* `{{component_name}}` can contain underscores (e.g., `deep_crawling`, `config_objects`).
|
||||
* The special contexts will have names like `crawl4ai_vibe_content.llm.md` and `crawl4ai_all_content.llm.md`.
|
||||
* **Component List:** You will be provided with a list of `crawl4ai` components. For this implementation, use the following list:
|
||||
* `core`
|
||||
* `config_objects`
|
||||
* `deep_crawling`
|
||||
* `deployment` (covers Installation & Docker Deployment)
|
||||
* `extraction` (covers Structured Data Extraction)
|
||||
* `markdown` (covers Markdown Generation Algorithm)
|
||||
* `pdf_processing`
|
||||
* *(No separate "Vibe Coding" or "All Library Context" in this list, as they are special top-level selections)*
|
||||
|
||||
**Detailed UI/UX Requirements:**
|
||||
|
||||
1. **Main Page Structure:**
|
||||
* **Header:** "Crawl4AI Interactive LLM Context Builder"
|
||||
* **Introduction:** Briefly explain the purpose of the tool (from the `USING_LLM_CONTEXTS.md` content you helped draft: "Supercharging Your AI Assistant...").
|
||||
* **Selection Area:**
|
||||
* **Special Aggregate Contexts (Radio Buttons or Prominent Checkboxes):**
|
||||
* [ ] "Vibe Coding Context" (`crawl4ai_vibe_content.llm.md`)
|
||||
* [ ] "All Library Context (Comprehensive)" (`crawl4ai_all_content.llm.md`)
|
||||
* *Behavior:* Selecting one of these might disable individual component selections (or vice-versa) to avoid redundancy, or simply add them to the list. Consider user experience here. A simple approach is that if an aggregate is selected, it's the *only* thing downloaded.
|
||||
* **Individual Component Selection (Table or List of Checkboxes):**
|
||||
* A section titled "Select Individual Components & Context Types:"
|
||||
* For each component in the provided list:
|
||||
* A master checkbox for the component itself (e.g., `[ ] Core Functionality`). Selected by default.
|
||||
* Nested checkboxes (indented or grouped) for context types, enabled only if the parent component is checked:
|
||||
* `[x] Memory (API Facts)`
|
||||
* `[x] Reasoning (How-to/Why)`
|
||||
* `[x] Examples (Code Snippets)`
|
||||
(These three sub-checkboxes should be selected by default if the parent component is selected).
|
||||
* **Action Button:**
|
||||
* A button: "Generate & Download Combined Context"
|
||||
* **Status/Feedback Area:** (Optional, but good UX)
|
||||
* Display messages like "Fetching files...", "Combining context...", "Download starting..." or error messages.
|
||||
|
||||
|
||||
**Final Output:**
|
||||
|
||||
* A single HTML file (e.g., `interactive_context_builder.html`).
|
||||
* Associated JavaScript code (can be inline within `<script>` tags or in a separate `.js` file).
|
||||
* Associated CSS code (can be inline within `<style>` tags or in a separate `.css` file).
|
||||
|
||||
This interactive tool will greatly enhance the user experience for `crawl4ai` developers looking to leverage your specialized LLM contexts. Please ensure the JavaScript is robust and provides good user feedback.
|
||||
|
||||
---
|
||||
|
||||
This prompt should give your AI coding assistant a very clear set of requirements and guidelines for building the interactive context builder. Remember to provide it with the list of components as mentioned in the "Input/Assumptions" section.
|
||||
135
docs/md_v2/apps/llmtxt/index.html
Normal file
@@ -0,0 +1,135 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Crawl4AI LLM Context Builder</title>
|
||||
<link rel="stylesheet" href="llmtxt.css">
|
||||
</head>
|
||||
<body>
|
||||
<div class="terminal-container">
|
||||
<div class="header">
|
||||
<div class="header-content">
|
||||
<div class="logo-section">
|
||||
<img src="../../img/favicon-32x32.png" alt="Crawl4AI Logo" class="logo">
|
||||
<div>
|
||||
<h1>Crawl4AI LLM Context Builder</h1>
|
||||
<p class="tagline">Multi-Dimensional Context for AI Assistants</p>
|
||||
</div>
|
||||
</div>
|
||||
<nav class="nav-links">
|
||||
<a href="../../" class="nav-link">← Back to Docs</a>
|
||||
<a href="../" class="nav-link">All Apps</a>
|
||||
<a href="https://github.com/unclecode/crawl4ai" class="nav-link" target="_blank">GitHub</a>
|
||||
</nav>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="content">
|
||||
|
||||
<section class="intro">
|
||||
<div class="intro-header">
|
||||
<h2>🧠 A New Approach to LLM Context</h2>
|
||||
<p>
|
||||
Traditional <code>llm.txt</code> files often fail with complex libraries like Crawl4AI. They dump massive amounts of API documentation, causing <strong>information overload</strong> and <strong>lost focus</strong>. They provide the "what" but miss the crucial "how" and "why" that makes AI assistants truly helpful.
|
||||
</p>
|
||||
</div>
|
||||
|
||||
<div class="intro-solution">
|
||||
<h3>💡 The Solution: Multi-Dimensional, Modular Contexts</h3>
|
||||
<p>
|
||||
Inspired by modular libraries like Lodash, I've redesigned how we provide context to AI assistants. Instead of one monolithic file, Crawl4AI's documentation is organized by <strong>components</strong> and <strong>perspectives</strong>.
|
||||
</p>
|
||||
|
||||
<div class="dimensions">
|
||||
<div class="dimension">
|
||||
<span class="badge memory">Memory</span>
|
||||
<h4>The "What"</h4>
|
||||
<p>Precise API facts, parameters, signatures, and configuration objects. Your unambiguous reference.</p>
|
||||
</div>
|
||||
<div class="dimension">
|
||||
<span class="badge reasoning">Reasoning</span>
|
||||
<h4>The "How" & "Why"</h4>
|
||||
<p>Design principles, best practices, trade-offs, and workflows. Teaches AI to think like an expert.</p>
|
||||
</div>
|
||||
<div class="dimension">
|
||||
<span class="badge examples">Examples</span>
|
||||
<h4>The "Show Me"</h4>
|
||||
<p>Runnable code snippets demonstrating patterns in action. Pure practical implementation.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="intro-benefits">
|
||||
<p>
|
||||
<strong>Why this matters:</strong> You can now give your AI assistant exactly what it needs - whether that's quick API lookups, help designing solutions, or seeing practical implementations. No more information overload, just focused, relevant context.
|
||||
</p>
|
||||
<p class="learn-more">
|
||||
<a href="/blog/articles/llm-context-revolution" class="learn-more-link" target="_parent">📖 Read the full story behind this approach →</a>
|
||||
</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="builder">
|
||||
<div class="component-selector" id="component-selector">
|
||||
<h2>Select Components & Context Types</h2>
|
||||
<div class="select-all-controls">
|
||||
<button class="btn-small" id="select-all">Select All</button>
|
||||
<button class="btn-small" id="deselect-all">Deselect All</button>
|
||||
</div>
|
||||
|
||||
<div class="component-table-wrapper">
|
||||
<table class="component-selection-table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th width="50"></th>
|
||||
<th>Component</th>
|
||||
<th class="clickable-header" data-type="memory">Memory<br><span class="header-subtitle">Full Content</span></th>
|
||||
<th class="clickable-header" data-type="reasoning">Reasoning<br><span class="header-subtitle">Diagrams</span></th>
|
||||
<th class="clickable-header" data-type="examples">Examples<br><span class="header-subtitle">Code</span></th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody id="components-tbody">
|
||||
<!-- Components will be dynamically inserted here -->
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="action-area">
|
||||
<div class="token-summary" id="token-summary">
|
||||
<span class="token-label">Estimated Tokens:</span>
|
||||
<span class="token-count" id="total-tokens">0</span>
|
||||
</div>
|
||||
<button class="download-btn" id="download-btn">
|
||||
<span class="icon">⬇</span> Generate & Download Context
|
||||
</button>
|
||||
<div class="status" id="status"></div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="reference-table">
|
||||
<h2>Available Context Files</h2>
|
||||
<div class="table-wrapper">
|
||||
<table class="context-table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Component</th>
|
||||
<th>Memory</th>
|
||||
<th>Reasoning</th>
|
||||
<th>Examples</th>
|
||||
<th>Full</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody id="reference-table-body">
|
||||
<!-- Table rows will be dynamically inserted here -->
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</section>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script src="llmtxt.js"></script>
|
||||
</body>
|
||||
</html>
|
||||
603
docs/md_v2/apps/llmtxt/llmtxt.css
Normal file
@@ -0,0 +1,603 @@
|
||||
/* Terminal Theme CSS for LLM Context Builder */
|
||||
|
||||
/* Font Face Definitions */
|
||||
@font-face {
|
||||
font-family: 'Dank Mono';
|
||||
src: url('../assets/DankMono-Regular.woff2') format('woff2');
|
||||
font-weight: 400;
|
||||
font-style: normal;
|
||||
font-display: swap;
|
||||
}
|
||||
|
||||
@font-face {
|
||||
font-family: 'Dank Mono';
|
||||
src: url('../assets/DankMono-Bold.woff2') format('woff2');
|
||||
font-weight: 700;
|
||||
font-style: normal;
|
||||
font-display: swap;
|
||||
}
|
||||
|
||||
@font-face {
|
||||
font-family: 'Dank Mono';
|
||||
src: url('../assets/DankMono-Italic.woff2') format('woff2');
|
||||
font-weight: 400;
|
||||
font-style: italic;
|
||||
font-display: swap;
|
||||
}
|
||||
|
||||
:root {
|
||||
--background-color: #070708;
|
||||
--font-color: #e8e9ed;
|
||||
--primary-color: #50ffff;
|
||||
--primary-dimmed: #09b5a5;
|
||||
--secondary-color: #d5cec0;
|
||||
--tertiary-color: #a3abba;
|
||||
--accent-color: rgb(243, 128, 245);
|
||||
--error-color: #ff3c74;
|
||||
--code-bg-color: #3f3f44;
|
||||
--border-color: #3f3f44;
|
||||
--hover-bg: #1a1a1c;
|
||||
--success-color: #50ff50;
|
||||
--bg-secondary: #1a1a1a;
|
||||
--primary-green: #0fbbaa;
|
||||
--font-primary: 'Dank Mono', dm, Monaco, Courier New, monospace;
|
||||
}
|
||||
|
||||
* {
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
body {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
font-family: var(--font-primary);
|
||||
font-size: 14px;
|
||||
line-height: 1.5;
|
||||
background-color: var(--background-color);
|
||||
color: var(--font-color);
|
||||
}
|
||||
|
||||
/* Terminal Container */
|
||||
.terminal-container {
|
||||
min-height: 100vh;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
}
|
||||
|
||||
/* Header - matching assistant layout */
|
||||
.header {
|
||||
background: var(--bg-secondary);
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
padding: 1.5rem 0;
|
||||
position: sticky;
|
||||
top: 0;
|
||||
z-index: 100;
|
||||
backdrop-filter: blur(10px);
|
||||
background: rgba(26, 26, 26, 0.95);
|
||||
}
|
||||
|
||||
.header-content {
|
||||
max-width: 1200px;
|
||||
margin: 0 auto;
|
||||
padding: 0 2rem;
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
}
|
||||
|
||||
.logo-section {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
.logo {
|
||||
width: 48px;
|
||||
height: 48px;
|
||||
}
|
||||
|
||||
.logo-section h1 {
|
||||
font-size: 1.75rem;
|
||||
font-weight: 700;
|
||||
color: var(--font-color);
|
||||
margin: 0;
|
||||
}
|
||||
|
||||
.tagline {
|
||||
font-size: 0.875rem;
|
||||
color: var(--tertiary-color);
|
||||
margin-top: 0.25rem;
|
||||
}
|
||||
|
||||
.nav-links {
|
||||
display: flex;
|
||||
gap: 2rem;
|
||||
}
|
||||
|
||||
.nav-link {
|
||||
color: var(--tertiary-color);
|
||||
text-decoration: none;
|
||||
font-size: 0.875rem;
|
||||
transition: color 0.2s ease;
|
||||
}
|
||||
|
||||
.nav-link:hover {
|
||||
color: var(--primary-green);
|
||||
}
|
||||
|
||||
/* Content */
|
||||
.content {
|
||||
flex: 1;
|
||||
max-width: 1200px;
|
||||
margin: 0 auto;
|
||||
padding: 2rem;
|
||||
width: 100%;
|
||||
}
|
||||
|
||||
/* Intro Section */
|
||||
.intro {
|
||||
background-color: var(--code-bg-color);
|
||||
border: 1px solid var(--border-color);
|
||||
padding: 30px;
|
||||
margin-bottom: 30px;
|
||||
}
|
||||
|
||||
.intro-header h2 {
|
||||
color: var(--primary-color);
|
||||
margin: 0 0 15px 0;
|
||||
font-size: 20px;
|
||||
}
|
||||
|
||||
.intro-header p {
|
||||
line-height: 1.6;
|
||||
margin-bottom: 0;
|
||||
}
|
||||
|
||||
.intro-header code {
|
||||
background-color: var(--hover-bg);
|
||||
padding: 2px 6px;
|
||||
color: var(--primary-dimmed);
|
||||
}
|
||||
|
||||
.intro-solution {
|
||||
margin-top: 5px;
|
||||
padding-top: 25px;
|
||||
border-top: 1px dashed var(--border-color);
|
||||
}
|
||||
|
||||
.intro-solution h3 {
|
||||
color: var(--secondary-color);
|
||||
margin: 0 0 15px 0;
|
||||
font-size: 18px;
|
||||
}
|
||||
|
||||
.dimensions {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
|
||||
gap: 20px;
|
||||
margin: 20px 0;
|
||||
}
|
||||
|
||||
.dimension {
|
||||
background-color: var(--hover-bg);
|
||||
padding: 20px;
|
||||
border: 1px solid var(--border-color);
|
||||
transition: all 0.2s ease;
|
||||
}
|
||||
|
||||
.dimension:hover {
|
||||
border-color: var(--primary-dimmed);
|
||||
}
|
||||
|
||||
.dimension h4 {
|
||||
color: var(--font-color);
|
||||
margin: 10px 0 8px 0;
|
||||
font-size: 16px;
|
||||
}
|
||||
|
||||
.dimension p {
|
||||
font-size: 13px;
|
||||
line-height: 1.5;
|
||||
color: var(--tertiary-color);
|
||||
margin: 0;
|
||||
}
|
||||
|
||||
.intro-benefits {
|
||||
margin-top: 0px;
|
||||
padding-top: 0x;
|
||||
border-top: 1px dashed var(--border-color);
|
||||
}
|
||||
|
||||
.intro-benefits strong {
|
||||
color: var(--primary-color);
|
||||
}
|
||||
|
||||
.learn-more {
|
||||
margin-top: 15px;
|
||||
}
|
||||
|
||||
.learn-more-link {
|
||||
color: var(--primary-dimmed);
|
||||
text-decoration: none;
|
||||
font-weight: bold;
|
||||
transition: color 0.2s ease;
|
||||
}
|
||||
|
||||
.learn-more-link:hover {
|
||||
color: var(--primary-color);
|
||||
text-decoration: underline;
|
||||
}
|
||||
|
||||
.badge {
|
||||
display: inline-block;
|
||||
padding: 2px 8px;
|
||||
font-size: 12px;
|
||||
text-transform: uppercase;
|
||||
margin-right: 8px;
|
||||
}
|
||||
|
||||
.badge.memory {
|
||||
background-color: var(--primary-dimmed);
|
||||
color: var(--background-color);
|
||||
}
|
||||
|
||||
.badge.reasoning {
|
||||
background-color: var(--accent-color);
|
||||
color: var(--background-color);
|
||||
}
|
||||
|
||||
.badge.examples {
|
||||
background-color: var(--secondary-color);
|
||||
color: var(--background-color);
|
||||
}
|
||||
|
||||
/* Builder Section */
|
||||
.builder {
|
||||
margin-bottom: 40px;
|
||||
}
|
||||
|
||||
.builder h2 {
|
||||
color: var(--primary-color);
|
||||
font-size: 18px;
|
||||
margin-bottom: 20px;
|
||||
text-transform: uppercase;
|
||||
}
|
||||
|
||||
/* Preset Options */
|
||||
.preset-options {
|
||||
display: flex;
|
||||
gap: 20px;
|
||||
margin-bottom: 30px;
|
||||
}
|
||||
|
||||
.preset-option {
|
||||
flex: 1;
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.preset-option input[type="radio"] {
|
||||
display: none;
|
||||
}
|
||||
|
||||
.preset-card {
|
||||
border: 2px solid var(--border-color);
|
||||
padding: 20px;
|
||||
transition: all 0.2s ease;
|
||||
background-color: var(--code-bg-color);
|
||||
}
|
||||
|
||||
.preset-card h3 {
|
||||
margin: 0 0 10px 0;
|
||||
color: var(--secondary-color);
|
||||
font-size: 16px;
|
||||
}
|
||||
|
||||
.preset-card p {
|
||||
margin: 0;
|
||||
font-size: 12px;
|
||||
color: var(--tertiary-color);
|
||||
}
|
||||
|
||||
.preset-option input:checked + .preset-card {
|
||||
border-color: var(--primary-color);
|
||||
background-color: var(--hover-bg);
|
||||
}
|
||||
|
||||
.preset-card:hover {
|
||||
border-color: var(--primary-dimmed);
|
||||
}
|
||||
|
||||
/* Component Selector */
|
||||
.component-selector {
|
||||
margin-bottom: 30px;
|
||||
}
|
||||
|
||||
.select-all-controls {
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
|
||||
.btn-small {
|
||||
background-color: var(--code-bg-color);
|
||||
color: var(--font-color);
|
||||
border: 1px solid var(--border-color);
|
||||
padding: 5px 15px;
|
||||
margin-right: 10px;
|
||||
cursor: pointer;
|
||||
font-family: inherit;
|
||||
font-size: 12px;
|
||||
text-transform: uppercase;
|
||||
transition: all 0.2s ease;
|
||||
}
|
||||
|
||||
.btn-small:hover {
|
||||
background-color: var(--primary-dimmed);
|
||||
color: var(--background-color);
|
||||
}
|
||||
|
||||
/* Component Selection Table */
|
||||
.component-table-wrapper {
|
||||
overflow-x: auto;
|
||||
border: 1px solid var(--border-color);
|
||||
margin-top: 20px;
|
||||
}
|
||||
|
||||
.component-selection-table {
|
||||
width: 100%;
|
||||
border-collapse: collapse;
|
||||
background-color: var(--code-bg-color);
|
||||
}
|
||||
|
||||
.component-selection-table th,
|
||||
.component-selection-table td {
|
||||
padding: 12px;
|
||||
text-align: left;
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
}
|
||||
|
||||
.component-selection-table th {
|
||||
background-color: var(--hover-bg);
|
||||
color: var(--primary-color);
|
||||
text-transform: uppercase;
|
||||
font-size: 12px;
|
||||
letter-spacing: 1px;
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
.header-subtitle {
|
||||
font-size: 10px;
|
||||
color: var(--tertiary-color);
|
||||
text-transform: none;
|
||||
font-weight: normal;
|
||||
display: block;
|
||||
margin-top: 2px;
|
||||
}
|
||||
|
||||
.component-selection-table th.clickable-header {
|
||||
cursor: pointer;
|
||||
user-select: none;
|
||||
transition: background-color 0.2s ease;
|
||||
}
|
||||
|
||||
.component-selection-table th.clickable-header:hover {
|
||||
background-color: var(--primary-dimmed);
|
||||
color: var(--background-color);
|
||||
}
|
||||
|
||||
.component-selection-table th.clickable-header[data-type="examples"] {
|
||||
cursor: default;
|
||||
opacity: 0.5;
|
||||
}
|
||||
|
||||
.component-selection-table th.clickable-header[data-type="examples"]:hover {
|
||||
background-color: var(--hover-bg);
|
||||
color: var(--primary-color);
|
||||
}
|
||||
|
||||
.component-selection-table th:nth-child(3),
|
||||
.component-selection-table th:nth-child(4),
|
||||
.component-selection-table th:nth-child(5) {
|
||||
text-align: center;
|
||||
width: 120px;
|
||||
}
|
||||
|
||||
.component-selection-table td {
|
||||
font-size: 13px;
|
||||
}
|
||||
|
||||
.component-selection-table td:nth-child(3),
|
||||
.component-selection-table td:nth-child(4),
|
||||
.component-selection-table td:nth-child(5) {
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
.component-selection-table tr:hover td {
|
||||
background-color: var(--hover-bg);
|
||||
}
|
||||
|
||||
.component-name {
|
||||
color: var(--primary-color);
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
/* Token display in table cells */
|
||||
.token-info {
|
||||
display: block;
|
||||
font-size: 11px;
|
||||
color: var(--tertiary-color);
|
||||
margin-top: 2px;
|
||||
}
|
||||
|
||||
.component-selection-table input[type="checkbox"] {
|
||||
cursor: pointer;
|
||||
width: 16px;
|
||||
height: 16px;
|
||||
}
|
||||
|
||||
.component-selection-table input[type="checkbox"]:disabled {
|
||||
cursor: not-allowed;
|
||||
opacity: 0.3;
|
||||
}
|
||||
|
||||
/* Disabled row state */
|
||||
.component-selection-table tr.disabled td:not(:first-child) {
|
||||
opacity: 0.5;
|
||||
pointer-events: none;
|
||||
}
|
||||
|
||||
/* Action Area */
|
||||
.action-area {
|
||||
text-align: center;
|
||||
margin: 40px 0;
|
||||
}
|
||||
|
||||
/* Token Summary */
|
||||
.token-summary {
|
||||
margin-bottom: 20px;
|
||||
font-size: 16px;
|
||||
}
|
||||
|
||||
.token-label {
|
||||
color: var(--tertiary-color);
|
||||
margin-right: 10px;
|
||||
}
|
||||
|
||||
.token-count {
|
||||
color: var(--primary-color);
|
||||
font-weight: bold;
|
||||
font-size: 20px;
|
||||
}
|
||||
|
||||
.token-count::after {
|
||||
content: " est.";
|
||||
font-size: 12px;
|
||||
color: var(--tertiary-color);
|
||||
margin-left: 4px;
|
||||
}
|
||||
|
||||
.download-btn {
|
||||
background-color: var(--primary-dimmed);
|
||||
color: var(--background-color);
|
||||
border: none;
|
||||
padding: 15px 40px;
|
||||
font-size: 16px;
|
||||
font-family: inherit;
|
||||
cursor: pointer;
|
||||
text-transform: uppercase;
|
||||
letter-spacing: 1px;
|
||||
transition: all 0.2s ease;
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
gap: 10px;
|
||||
}
|
||||
|
||||
.download-btn:hover {
|
||||
background-color: var(--primary-color);
|
||||
transform: translateY(-2px);
|
||||
}
|
||||
|
||||
.download-btn .icon {
|
||||
font-size: 20px;
|
||||
}
|
||||
|
||||
.status {
|
||||
margin-top: 20px;
|
||||
font-size: 14px;
|
||||
min-height: 30px;
|
||||
}
|
||||
|
||||
.status.loading {
|
||||
color: var(--primary-color);
|
||||
}
|
||||
|
||||
.status.success {
|
||||
color: var(--success-color);
|
||||
}
|
||||
|
||||
.status.error {
|
||||
color: var(--error-color);
|
||||
}
|
||||
|
||||
/* Reference Table */
|
||||
.reference-table {
|
||||
margin-top: 60px;
|
||||
}
|
||||
|
||||
.reference-table h2 {
|
||||
color: var(--primary-color);
|
||||
font-size: 18px;
|
||||
margin-bottom: 20px;
|
||||
text-transform: uppercase;
|
||||
}
|
||||
|
||||
.table-wrapper {
|
||||
overflow-x: auto;
|
||||
border: 1px solid var(--border-color);
|
||||
}
|
||||
|
||||
.context-table {
|
||||
width: 100%;
|
||||
border-collapse: collapse;
|
||||
background-color: var(--code-bg-color);
|
||||
}
|
||||
|
||||
.context-table th,
|
||||
.context-table td {
|
||||
padding: 12px;
|
||||
text-align: left;
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
}
|
||||
|
||||
.context-table th {
|
||||
background-color: var(--hover-bg);
|
||||
color: var(--primary-color);
|
||||
text-transform: uppercase;
|
||||
font-size: 12px;
|
||||
letter-spacing: 1px;
|
||||
}
|
||||
|
||||
.context-table td {
|
||||
font-size: 13px;
|
||||
}
|
||||
|
||||
.context-table tr:hover td {
|
||||
background-color: var(--hover-bg);
|
||||
}
|
||||
|
||||
.file-link {
|
||||
color: var(--primary-dimmed);
|
||||
text-decoration: none;
|
||||
}
|
||||
|
||||
.file-link:hover {
|
||||
color: var(--primary-color);
|
||||
text-decoration: underline;
|
||||
}
|
||||
|
||||
.file-size {
|
||||
color: var(--tertiary-color);
|
||||
font-size: 11px;
|
||||
}
|
||||
|
||||
/* Responsive Design */
|
||||
@media (max-width: 768px) {
|
||||
.header-content {
|
||||
flex-direction: column;
|
||||
gap: 1.5rem;
|
||||
}
|
||||
|
||||
.nav-links {
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
.preset-options {
|
||||
flex-direction: column;
|
||||
}
|
||||
|
||||
.components-grid {
|
||||
grid-template-columns: 1fr;
|
||||
}
|
||||
|
||||
.content {
|
||||
padding: 1rem;
|
||||
}
|
||||
}
|
||||
580
docs/md_v2/apps/llmtxt/llmtxt.js
Normal file
@@ -0,0 +1,580 @@
|
||||
// Crawl4AI LLM Context Builder JavaScript
|
||||
|
||||
// Component definitions - order matters
|
||||
const components = [
|
||||
{
|
||||
id: 'installation',
|
||||
name: 'Installation',
|
||||
description: 'Setup and installation options'
|
||||
},
|
||||
{
|
||||
id: 'simple_crawling',
|
||||
name: 'Simple Crawling',
|
||||
description: 'Basic web crawling operations'
|
||||
},
|
||||
{
|
||||
id: 'config_objects',
|
||||
name: 'Configuration Objects',
|
||||
description: 'Browser and crawler configuration'
|
||||
},
|
||||
{
|
||||
id: 'extraction-llm',
|
||||
name: 'Data Extraction Using LLM',
|
||||
description: 'Structured data extraction strategies using LLMs'
|
||||
},
|
||||
{
|
||||
id: 'extraction-no-llm',
|
||||
name: 'Data Extraction Without LLM',
|
||||
description: 'Structured data extraction strategies without LLMs'
|
||||
},
|
||||
{
|
||||
id: 'multi_urls_crawling',
|
||||
name: 'Multi URLs Crawling',
|
||||
description: 'Crawling multiple URLs efficiently'
|
||||
},
|
||||
{
|
||||
id: 'deep_crawling',
|
||||
name: 'Deep Crawling',
|
||||
description: 'Multi-page crawling strategies'
|
||||
},
|
||||
{
|
||||
id: 'docker',
|
||||
name: 'Docker',
|
||||
description: 'Docker deployment and configuration'
|
||||
},
|
||||
{
|
||||
id: 'cli',
|
||||
name: 'CLI',
|
||||
description: 'Command-line interface usage'
|
||||
},
|
||||
{
|
||||
id: 'http_based_crawler_strategy',
|
||||
name: 'HTTP-based Crawler',
|
||||
description: 'HTTP crawler strategy implementation'
|
||||
},
|
||||
{
|
||||
id: 'url_seeder',
|
||||
name: 'URL Seeder',
|
||||
description: 'URL seeding and discovery'
|
||||
},
|
||||
{
|
||||
id: 'deep_crawl_advanced_filters_scorers',
|
||||
name: 'Advanced Filters & Scorers',
|
||||
description: 'Deep crawl filtering and scoring'
|
||||
}
|
||||
];
|
||||
|
||||
// Context types
|
||||
const contextTypes = ['memory', 'reasoning', 'examples'];
|
||||
|
||||
// State management
|
||||
const state = {
|
||||
selectedComponents: new Set(),
|
||||
selectedContextTypes: new Map(),
|
||||
tokenCounts: new Map() // Store token counts for each file
|
||||
};
|
||||
|
||||
// Initialize the application
|
||||
document.addEventListener('DOMContentLoaded', () => {
|
||||
renderComponents();
|
||||
renderReferenceTable();
|
||||
setupActionHandlers();
|
||||
setupColumnHeaderHandlers();
|
||||
|
||||
// Initialize first component as selected with available context types
|
||||
const firstComponent = components[0];
|
||||
state.selectedComponents.add(firstComponent.id);
|
||||
state.selectedContextTypes.set(firstComponent.id, new Set(['memory', 'reasoning']));
|
||||
updateComponentUI();
|
||||
});
|
||||
|
||||
// Helper function to count tokens (words × 2.5)
|
||||
function estimateTokens(text) {
|
||||
if (!text) return 0;
|
||||
const words = text.trim().split(/\s+/).length;
|
||||
return Math.round(words * 2.5);
|
||||
}
|
||||
|
||||
// Update total token count display
|
||||
function updateTotalTokenCount() {
|
||||
let totalTokens = 0;
|
||||
|
||||
state.selectedComponents.forEach(compId => {
|
||||
const types = state.selectedContextTypes.get(compId);
|
||||
if (types) {
|
||||
types.forEach(type => {
|
||||
const key = `${compId}-${type}`;
|
||||
totalTokens += state.tokenCounts.get(key) || 0;
|
||||
});
|
||||
}
|
||||
});
|
||||
|
||||
document.getElementById('total-tokens').textContent = totalTokens.toLocaleString();
|
||||
}
|
||||
|
||||
// Render component selection table
|
||||
function renderComponents() {
|
||||
const tbody = document.getElementById('components-tbody');
|
||||
tbody.innerHTML = '';
|
||||
|
||||
components.forEach(component => {
|
||||
const row = createComponentRow(component);
|
||||
tbody.appendChild(row);
|
||||
});
|
||||
|
||||
// Fetch token counts for all files
|
||||
fetchAllTokenCounts();
|
||||
}
|
||||
|
||||
// Create a component table row
|
||||
function createComponentRow(component) {
|
||||
const tr = document.createElement('tr');
|
||||
tr.id = `component-${component.id}`;
|
||||
|
||||
// Component checkbox cell
|
||||
const checkboxCell = document.createElement('td');
|
||||
checkboxCell.innerHTML = `
|
||||
<input type="checkbox" id="check-${component.id}"
|
||||
data-component="${component.id}">
|
||||
`;
|
||||
tr.appendChild(checkboxCell);
|
||||
|
||||
// Component name cell
|
||||
const nameCell = document.createElement('td');
|
||||
nameCell.innerHTML = `<span class="component-name">${component.name}</span>`;
|
||||
tr.appendChild(nameCell);
|
||||
|
||||
// Context type cells
|
||||
contextTypes.forEach(type => {
|
||||
const td = document.createElement('td');
|
||||
const key = `${component.id}-${type}`;
|
||||
const tokenCount = state.tokenCounts.get(key) || 0;
|
||||
const isDisabled = type === 'examples' ? 'disabled' : '';
|
||||
|
||||
td.innerHTML = `
|
||||
<input type="checkbox" id="check-${component.id}-${type}"
|
||||
data-component="${component.id}" data-type="${type}"
|
||||
${isDisabled}>
|
||||
<span class="token-info" id="tokens-${component.id}-${type}">
|
||||
${tokenCount > 0 ? `${tokenCount.toLocaleString()} tokens` : ''}
|
||||
</span>
|
||||
`;
|
||||
tr.appendChild(td);
|
||||
});
|
||||
|
||||
// Add event listeners
|
||||
const mainCheckbox = tr.querySelector(`#check-${component.id}`);
|
||||
mainCheckbox.addEventListener('change', (e) => {
|
||||
handleComponentToggle(component.id, e.target.checked);
|
||||
});
|
||||
|
||||
// Add event listeners for context type checkboxes
|
||||
contextTypes.forEach(type => {
|
||||
const typeCheckbox = tr.querySelector(`#check-${component.id}-${type}`);
|
||||
if (!typeCheckbox.disabled) {
|
||||
typeCheckbox.addEventListener('change', (e) => {
|
||||
handleContextTypeToggle(component.id, type, e.target.checked);
|
||||
});
|
||||
}
|
||||
});
|
||||
|
||||
return tr;
|
||||
}
|
||||
|
||||
// Handle component checkbox toggle
|
||||
function handleComponentToggle(componentId, checked) {
|
||||
if (checked) {
|
||||
state.selectedComponents.add(componentId);
|
||||
// Select only available context types when component is selected
|
||||
if (!state.selectedContextTypes.has(componentId)) {
|
||||
state.selectedContextTypes.set(componentId, new Set(['memory', 'reasoning']));
|
||||
} else {
|
||||
// If component was already partially selected, select all available
|
||||
state.selectedContextTypes.set(componentId, new Set(['memory', 'reasoning']));
|
||||
}
|
||||
} else {
|
||||
state.selectedComponents.delete(componentId);
|
||||
state.selectedContextTypes.delete(componentId);
|
||||
}
|
||||
updateComponentUI();
|
||||
}
|
||||
|
||||
// Handle component selection based on context types
|
||||
function updateComponentSelection(componentId) {
|
||||
const types = state.selectedContextTypes.get(componentId) || new Set();
|
||||
if (types.size > 0) {
|
||||
state.selectedComponents.add(componentId);
|
||||
} else {
|
||||
state.selectedComponents.delete(componentId);
|
||||
}
|
||||
}
|
||||
|
||||
// Handle context type checkbox toggle
|
||||
function handleContextTypeToggle(componentId, type, checked) {
|
||||
if (!state.selectedContextTypes.has(componentId)) {
|
||||
state.selectedContextTypes.set(componentId, new Set());
|
||||
}
|
||||
|
||||
const types = state.selectedContextTypes.get(componentId);
|
||||
if (checked) {
|
||||
types.add(type);
|
||||
} else {
|
||||
types.delete(type);
|
||||
}
|
||||
|
||||
updateComponentSelection(componentId);
|
||||
updateComponentUI();
|
||||
}
|
||||
|
||||
// Update UI to reflect current state
|
||||
function updateComponentUI() {
|
||||
components.forEach(component => {
|
||||
const row = document.getElementById(`component-${component.id}`);
|
||||
if (!row) return;
|
||||
|
||||
const mainCheckbox = row.querySelector(`#check-${component.id}`);
|
||||
const hasSelection = state.selectedComponents.has(component.id);
|
||||
const selectedTypes = state.selectedContextTypes.get(component.id) || new Set();
|
||||
|
||||
// Update main checkbox
|
||||
mainCheckbox.checked = hasSelection;
|
||||
|
||||
// Update row disabled state
|
||||
row.classList.toggle('disabled', !hasSelection);
|
||||
|
||||
// Update context type checkboxes
|
||||
contextTypes.forEach(type => {
|
||||
const typeCheckbox = row.querySelector(`#check-${component.id}-${type}`);
|
||||
typeCheckbox.checked = selectedTypes.has(type);
|
||||
});
|
||||
});
|
||||
|
||||
updateTotalTokenCount();
|
||||
}
|
||||
|
||||
// Fetch token counts for all files
|
||||
async function fetchAllTokenCounts() {
|
||||
const promises = [];
|
||||
|
||||
components.forEach(component => {
|
||||
contextTypes.forEach(type => {
|
||||
promises.push(fetchTokenCount(component.id, type));
|
||||
});
|
||||
});
|
||||
|
||||
await Promise.all(promises);
|
||||
updateComponentUI();
|
||||
renderReferenceTable(); // Update reference table with token counts
|
||||
}
|
||||
|
||||
// Fetch token count for a specific file
|
||||
async function fetchTokenCount(componentId, type) {
|
||||
const key = `${componentId}-${type}`;
|
||||
|
||||
try {
|
||||
const fileName = getFileName(componentId, type);
|
||||
const baseUrl = getBaseUrl(type);
|
||||
const response = await fetch(baseUrl + fileName);
|
||||
|
||||
if (response.ok) {
|
||||
const content = await response.text();
|
||||
const tokens = estimateTokens(content);
|
||||
state.tokenCounts.set(key, tokens);
|
||||
|
||||
// Update UI
|
||||
const tokenSpan = document.getElementById(`tokens-${componentId}-${type}`);
|
||||
if (tokenSpan) {
|
||||
tokenSpan.textContent = `${tokens.toLocaleString()} tokens`;
|
||||
}
|
||||
} else if (type === 'examples') {
|
||||
// Examples might not exist yet
|
||||
state.tokenCounts.set(key, 0);
|
||||
const tokenSpan = document.getElementById(`tokens-${componentId}-${type}`);
|
||||
if (tokenSpan) {
|
||||
tokenSpan.textContent = '';
|
||||
}
|
||||
}
|
||||
} catch (error) {
|
||||
console.warn(`Failed to fetch token count for ${componentId}-${type}`);
|
||||
if (type === 'examples') {
|
||||
const tokenSpan = document.getElementById(`tokens-${componentId}-${type}`);
|
||||
if (tokenSpan) {
|
||||
tokenSpan.textContent = '';
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Get file name based on component and type
|
||||
function getFileName(componentId, type) {
|
||||
// For new structure, all files are just [componentId].txt
|
||||
return `${componentId}.txt`;
|
||||
}
|
||||
|
||||
// Get base URL based on context type
|
||||
function getBaseUrl(type) {
|
||||
// For MkDocs, we need to go up to the root level
|
||||
const basePrefix = window.location.pathname.includes('/apps/') ? '../../' : '/';
|
||||
|
||||
switch(type) {
|
||||
case 'memory':
|
||||
return basePrefix + 'assets/llm.txt/txt/';
|
||||
case 'reasoning':
|
||||
return basePrefix + 'assets/llm.txt/diagrams/';
|
||||
case 'examples':
|
||||
return basePrefix + 'assets/llm.txt/examples/'; // Will return 404 for now
|
||||
default:
|
||||
return basePrefix + 'assets/llm.txt/txt/';
|
||||
}
|
||||
}
|
||||
|
||||
// Setup action button handlers
|
||||
function setupActionHandlers() {
|
||||
// Select/Deselect all buttons
|
||||
document.getElementById('select-all').addEventListener('click', () => {
|
||||
components.forEach(comp => {
|
||||
state.selectedComponents.add(comp.id);
|
||||
state.selectedContextTypes.set(comp.id, new Set(['memory', 'reasoning']));
|
||||
});
|
||||
updateComponentUI();
|
||||
});
|
||||
|
||||
document.getElementById('deselect-all').addEventListener('click', () => {
|
||||
state.selectedComponents.clear();
|
||||
state.selectedContextTypes.clear();
|
||||
updateComponentUI();
|
||||
});
|
||||
|
||||
// Download button
|
||||
document.getElementById('download-btn').addEventListener('click', handleDownload);
|
||||
}
|
||||
|
||||
// Setup column header click handlers
|
||||
function setupColumnHeaderHandlers() {
|
||||
const headers = document.querySelectorAll('.clickable-header');
|
||||
headers.forEach(header => {
|
||||
header.addEventListener('click', () => {
|
||||
const type = header.getAttribute('data-type');
|
||||
toggleColumnSelection(type);
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
// Toggle all checkboxes in a column
|
||||
function toggleColumnSelection(type) {
|
||||
// Don't toggle examples column
|
||||
if (type === 'examples') return;
|
||||
|
||||
// Check if all are currently selected
|
||||
let allSelected = true;
|
||||
components.forEach(comp => {
|
||||
const types = state.selectedContextTypes.get(comp.id);
|
||||
if (!types || !types.has(type)) {
|
||||
allSelected = false;
|
||||
}
|
||||
});
|
||||
|
||||
// Toggle all
|
||||
components.forEach(comp => {
|
||||
if (!state.selectedContextTypes.has(comp.id)) {
|
||||
state.selectedContextTypes.set(comp.id, new Set());
|
||||
}
|
||||
|
||||
const types = state.selectedContextTypes.get(comp.id);
|
||||
if (allSelected) {
|
||||
types.delete(type);
|
||||
} else {
|
||||
types.add(type);
|
||||
}
|
||||
|
||||
updateComponentSelection(comp.id);
|
||||
});
|
||||
|
||||
updateComponentUI();
|
||||
}
|
||||
|
||||
// Handle download action
|
||||
async function handleDownload() {
|
||||
const statusEl = document.getElementById('status');
|
||||
statusEl.textContent = 'Preparing context files...';
|
||||
statusEl.className = 'status loading';
|
||||
|
||||
try {
|
||||
const files = getSelectedFiles();
|
||||
if (files.length === 0) {
|
||||
throw new Error('No files selected. Please select at least one component or preset.');
|
||||
}
|
||||
|
||||
statusEl.textContent = `Fetching ${files.length} files...`;
|
||||
|
||||
const contents = await fetchFiles(files);
|
||||
const combined = combineContents(contents);
|
||||
|
||||
downloadFile(combined, 'crawl4ai_custom_context.md');
|
||||
|
||||
statusEl.textContent = 'Download complete!';
|
||||
statusEl.className = 'status success';
|
||||
|
||||
setTimeout(() => {
|
||||
statusEl.textContent = '';
|
||||
statusEl.className = 'status';
|
||||
}, 3000);
|
||||
|
||||
} catch (error) {
|
||||
statusEl.textContent = `Error: ${error.message}`;
|
||||
statusEl.className = 'status error';
|
||||
}
|
||||
}
|
||||
|
||||
// Get list of selected files based on current state
|
||||
function getSelectedFiles() {
|
||||
const files = [];
|
||||
|
||||
// Build list of selected files with their context info
|
||||
state.selectedComponents.forEach(compId => {
|
||||
const types = state.selectedContextTypes.get(compId);
|
||||
if (types) {
|
||||
types.forEach(type => {
|
||||
files.push({
|
||||
componentId: compId,
|
||||
type: type,
|
||||
fileName: getFileName(compId, type),
|
||||
baseUrl: getBaseUrl(type)
|
||||
});
|
||||
});
|
||||
}
|
||||
});
|
||||
|
||||
return files;
|
||||
}
|
||||
|
||||
// Fetch multiple files
|
||||
async function fetchFiles(fileInfos) {
|
||||
const promises = fileInfos.map(async (fileInfo) => {
|
||||
try {
|
||||
const response = await fetch(fileInfo.baseUrl + fileInfo.fileName);
|
||||
if (!response.ok) {
|
||||
if (fileInfo.type === 'examples') {
|
||||
return {
|
||||
fileInfo,
|
||||
content: `<!-- Examples for ${fileInfo.componentId} coming soon -->\n\nExamples are currently being developed for this component.`
|
||||
};
|
||||
}
|
||||
console.warn(`Failed to fetch ${fileInfo.fileName} from ${fileInfo.baseUrl + fileInfo.fileName}`);
|
||||
return { fileInfo, content: `<!-- Failed to load ${fileInfo.fileName} -->` };
|
||||
}
|
||||
const content = await response.text();
|
||||
return { fileInfo, content };
|
||||
} catch (error) {
|
||||
if (fileInfo.type === 'examples') {
|
||||
return {
|
||||
fileInfo,
|
||||
content: `<!-- Examples for ${fileInfo.componentId} coming soon -->\n\nExamples are currently being developed for this component.`
|
||||
};
|
||||
}
|
||||
console.warn(`Error fetching ${fileInfo.fileName}:`, error);
|
||||
return { fileInfo, content: `<!-- Error loading ${fileInfo.fileName} -->` };
|
||||
}
|
||||
});
|
||||
|
||||
return Promise.all(promises);
|
||||
}
|
||||
|
||||
// Combine file contents with headers
|
||||
function combineContents(fileContents) {
|
||||
// Calculate total tokens
|
||||
let totalTokens = 0;
|
||||
fileContents.forEach(({ content }) => {
|
||||
totalTokens += estimateTokens(content);
|
||||
});
|
||||
|
||||
const header = `# Crawl4AI Custom LLM Context
|
||||
Generated on: ${new Date().toISOString()}
|
||||
Total files: ${fileContents.length}
|
||||
Estimated tokens: ${totalTokens.toLocaleString()}
|
||||
|
||||
---
|
||||
|
||||
`;
|
||||
|
||||
const sections = fileContents.map(({ fileInfo, content }) => {
|
||||
const component = components.find(c => c.id === fileInfo.componentId);
|
||||
const componentName = component ? component.name : fileInfo.componentId;
|
||||
const contextType = getContextTypeName(fileInfo.type);
|
||||
const tokens = estimateTokens(content);
|
||||
|
||||
return `## ${componentName} - ${contextType}
|
||||
Component ID: ${fileInfo.componentId}
|
||||
Context Type: ${fileInfo.type}
|
||||
Estimated tokens: ${tokens.toLocaleString()}
|
||||
|
||||
${content}
|
||||
|
||||
---
|
||||
|
||||
`;
|
||||
});
|
||||
|
||||
return header + sections.join('\n');
|
||||
}
|
||||
|
||||
// Get display name for context type
|
||||
function getContextTypeName(type) {
|
||||
switch(type) {
|
||||
case 'memory': return 'Full Content';
|
||||
case 'reasoning': return 'Diagrams & Workflows';
|
||||
case 'examples': return 'Code Examples';
|
||||
default: return type;
|
||||
}
|
||||
}
|
||||
|
||||
// Download file to user's computer
|
||||
function downloadFile(content, fileName) {
|
||||
const blob = new Blob([content], { type: 'text/markdown' });
|
||||
const url = URL.createObjectURL(blob);
|
||||
const a = document.createElement('a');
|
||||
a.href = url;
|
||||
a.download = fileName;
|
||||
document.body.appendChild(a);
|
||||
a.click();
|
||||
document.body.removeChild(a);
|
||||
URL.revokeObjectURL(url);
|
||||
}
|
||||
|
||||
// Render reference table
|
||||
function renderReferenceTable() {
|
||||
const tbody = document.getElementById('reference-table-body');
|
||||
tbody.innerHTML = '';
|
||||
|
||||
// Get base path for links
|
||||
const basePrefix = window.location.pathname.includes('/apps/') ? '../../' : '/';
|
||||
|
||||
components.forEach(component => {
|
||||
const row = document.createElement('tr');
|
||||
const memoryTokens = state.tokenCounts.get(`${component.id}-memory`) || 0;
|
||||
const reasoningTokens = state.tokenCounts.get(`${component.id}-reasoning`) || 0;
|
||||
const examplesTokens = state.tokenCounts.get(`${component.id}-examples`) || 0;
|
||||
|
||||
row.innerHTML = `
|
||||
<td><strong>${component.name}</strong></td>
|
||||
<td>
|
||||
<a href="${basePrefix}assets/llm.txt/txt/${component.id}.txt" class="file-link" target="_blank">Memory</a>
|
||||
${memoryTokens > 0 ? `<span class="file-size">${memoryTokens.toLocaleString()} tokens</span>` : ''}
|
||||
</td>
|
||||
<td>
|
||||
<a href="${basePrefix}assets/llm.txt/diagrams/${component.id}.txt" class="file-link" target="_blank">Reasoning</a>
|
||||
${reasoningTokens > 0 ? `<span class="file-size">${reasoningTokens.toLocaleString()} tokens</span>` : ''}
|
||||
</td>
|
||||
<td>
|
||||
${examplesTokens > 0
|
||||
? `<a href="${basePrefix}assets/llm.txt/examples/${component.id}.txt" class="file-link" target="_blank">Examples</a>
|
||||
<span class="file-size">${examplesTokens.toLocaleString()} tokens</span>`
|
||||
: '-'
|
||||
}
|
||||
</td>
|
||||
<td>-</td>
|
||||
`;
|
||||
tbody.appendChild(row);
|
||||
});
|
||||
}
|
||||
|
||||
37
docs/md_v2/apps/llmtxt/why.md
Normal file
@@ -0,0 +1,37 @@
|
||||
# Supercharging Your AI Assistant: My Journey to Better LLM Contexts for `crawl4ai`
|
||||
|
||||
When I started diving deep into using AI coding assistants with my own libraries, particularly `crawl4ai`, I quickly realized that the common approach to providing context via a simple `llm.txt` or even a beefed-up `README.md` just wasn't cutting it. This document explains the problems I encountered and how I've tried to create a more effective system for `crawl4ai`, allowing you (and your AI assistant) to get precisely the right information.
|
||||
|
||||
## My Frustration with Standard `llm.txt` Files
|
||||
|
||||
My experience with generic `llm.txt` files for complex libraries like `crawl4ai` revealed several pain points:
|
||||
|
||||
1. **Information Overload & Lost Focus:** I found that when I threw a massive, monolithic context file at an LLM, it often struggled. The sheer volume of information seemed to dilute its focus. If I asked a specific question about a niche feature, the LLM might get sidetracked by more prominent but currently irrelevant parts of the library. It felt like trying to find a single sentence in a thousand-page novel – the information was *there*, but not always accessible or prioritized correctly by the AI.
|
||||
|
||||
2. **The "What" Without the "How" or "Why":** Most `llm.txt` files I encountered were essentially API dumps – a list of functions, classes, and parameters. This is the "what" of a library. But to truly use a library effectively, especially one as flexible as `crawl4ai`, you need the "how" (idiomatic usage patterns, best practices for common tasks) and the "why" (the design rationale behind certain features). Without this, I noticed my AI assistant would often generate syntactically correct but practically inefficient or non-idiomatic code. It was guessing the *intent* and the *best way* to use the library, and those guesses weren't always right.
|
||||
|
||||
3. **No Guidance on "Thinking" Like an Expert:** A static list of facts doesn't teach an LLM the *art* of using the library. It doesn't convey the trade-offs an experienced developer considers, the common pitfalls they've learned to avoid, or the clever ways to combine features to solve complex problems. I wanted my AI assistant to not just recall an API, but to help me *reason* about the best way to build a solution with `crawl4ai`.
|
||||
|
||||
## Inspiration: Selective Inclusion & Multi-Dimensional Understanding
|
||||
|
||||
I've always admired how libraries like Lodash or jQuery (in its modular days) allowed developers to pick and choose only the parts they needed, resulting in smaller, more focused bundles. This idea of modularity and selective inclusion resonated deeply with me as I thought about LLM context. Why force-feed an LLM the entire library's details when I'm only working on a specific component or task?
|
||||
|
||||
This led me to develop a new approach for `crawl4ai`: **multi-dimensional, modular contexts**.
|
||||
|
||||
Instead of one giant `llm.txt`, I've broken down the `crawl4ai` documentation into:
|
||||
|
||||
1. **Logical Components:** Context is organized around the major functional areas of the library (e.g., Core, Data Extraction, Deep Crawling, Markdown Generation, etc.). This allows you to select context relevant only to the task at hand.
|
||||
2. **Three Dimensions of Context for Each Component:**
|
||||
* **`_memory.md` (Foundational Memory):** This is the "what." It contains the precise, factual information about the component's public API, data structures, configuration objects, parameters, and method signatures. It's the detailed, unambiguous reference.
|
||||
* **`_reasoning.md` (Reasoning & Problem-Solving Framework):** This is the "how" and "why." It includes design principles, common task workflows with decision guides, best practices, anti-patterns, illustrative code examples solving real problems, and explanations of trade-offs. It aims to guide the LLM in "thinking" like an expert `crawl4ai` user.
|
||||
* **`_examples.md` (Practical Code Examples):** This is pure "show-me-the-code." It's a collection of runnable snippets demonstrating various ways to use the component's features and configurations, with minimal explanatory text. It’s for quickly seeing different patterns in action.
|
||||
|
||||
**The Goal:**
|
||||
My aim is to provide you with a flexible system. You can give your AI assistant:
|
||||
* Just the **memory** files for quick API lookups.
|
||||
* The **reasoning** files (perhaps with memory) for help designing solutions.
|
||||
* The **examples** files for seeing practical implementations.
|
||||
* A **combination** of these across one or more components tailored to your specific task.
|
||||
* Or, for broader understanding, special aggregate contexts like the "Vibe Coding" context or the "All Library Context."
|
||||
|
||||
By providing these structured, multi-faceted contexts, I hope to significantly improve the quality and relevance of the assistance you get when using AI to code with `crawl4ai`. The following sections will guide you on how to select and use these different context files.
|
||||
37
docs/md_v2/assets/feedback-overrides.css
Normal file
@@ -0,0 +1,37 @@
|
||||
/* docs/assets/feedback-overrides.css */
|
||||
:root {
|
||||
/* brand */
|
||||
--feedback-primary-color: #09b5a5;
|
||||
--feedback-highlight-color: #fed500; /* stars etc */
|
||||
|
||||
/* modal shell / text */
|
||||
--feedback-modal-content-bg-color: var(--background-color);
|
||||
--feedback-modal-content-text-color: var(--font-color);
|
||||
--feedback-modal-content-border-color: var(--primary-dimmed-color);
|
||||
--feedback-modal-content-border-radius: 4px;
|
||||
|
||||
/* overlay */
|
||||
--feedback-overlay-bg-color: rgba(0,0,0,.75);
|
||||
|
||||
/* rating buttons */
|
||||
--feedback-modal-rating-button-color: var(--secondary-color);
|
||||
--feedback-modal-rating-button-selected-color: var(--primary-color);
|
||||
|
||||
/* inputs */
|
||||
--feedback-modal-input-bg-color: var(--code-bg-color);
|
||||
--feedback-modal-input-text-color: var(--font-color);
|
||||
--feedback-modal-input-border-color: var(--primary-dimmed-color);
|
||||
--feedback-modal-input-border-color-focused: var(--primary-color);
|
||||
|
||||
/* submit / secondary buttons */
|
||||
--feedback-modal-button-submit-bg-color: var(--primary-color);
|
||||
--feedback-modal-button-submit-bg-color-hover: var(--primary-dimmed-color);
|
||||
--feedback-modal-button-submit-text-color: var(--invert-font-color);
|
||||
|
||||
--feedback-modal-button-bg-color: transparent; /* screenshot btn */
|
||||
--feedback-modal-button-border-color: var(--primary-color);
|
||||
--feedback-modal-button-icon-color: var(--primary-color);
|
||||
}
|
||||
|
||||
/* optional: keep the “Powered by” link subtle */
|
||||
.feedback-logo a{color:var(--secondary-color);}
|
||||
5
docs/md_v2/assets/gtag.js
Normal file
@@ -0,0 +1,5 @@
|
||||
window.dataLayer = window.dataLayer || [];
|
||||
function gtag(){dataLayer.push(arguments);}
|
||||
gtag('js', new Date());
|
||||
|
||||
gtag('config', 'G-58W0K2ZQ25');
|
||||
425
docs/md_v2/assets/llm.txt/diagrams/cli.txt
Normal file
@@ -0,0 +1,425 @@
|
||||
## CLI Workflows and Profile Management
|
||||
|
||||
Visual representations of command-line interface operations, browser profile management, and identity-based crawling workflows.
|
||||
|
||||
### CLI Command Flow Architecture
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[crwl command] --> B{Command Type?}
|
||||
|
||||
B -->|URL Crawling| C[Parse URL & Options]
|
||||
B -->|Profile Management| D[profiles subcommand]
|
||||
B -->|CDP Browser| E[cdp subcommand]
|
||||
B -->|Browser Control| F[browser subcommand]
|
||||
B -->|Configuration| G[config subcommand]
|
||||
|
||||
C --> C1{Output Format?}
|
||||
C1 -->|Default| C2[HTML/Markdown]
|
||||
C1 -->|JSON| C3[Structured Data]
|
||||
C1 -->|markdown| C4[Clean Markdown]
|
||||
C1 -->|markdown-fit| C5[Filtered Content]
|
||||
|
||||
C --> C6{Authentication?}
|
||||
C6 -->|Profile Specified| C7[Load Browser Profile]
|
||||
C6 -->|No Profile| C8[Anonymous Session]
|
||||
|
||||
C7 --> C9[Launch with User Data]
|
||||
C8 --> C10[Launch Clean Browser]
|
||||
|
||||
C9 --> C11[Execute Crawl]
|
||||
C10 --> C11
|
||||
|
||||
C11 --> C12{Success?}
|
||||
C12 -->|Yes| C13[Return Results]
|
||||
C12 -->|No| C14[Error Handling]
|
||||
|
||||
D --> D1[Interactive Profile Menu]
|
||||
D1 --> D2{Menu Choice?}
|
||||
D2 -->|Create| D3[Open Browser for Setup]
|
||||
D2 -->|List| D4[Show Existing Profiles]
|
||||
D2 -->|Delete| D5[Remove Profile]
|
||||
D2 -->|Use| D6[Crawl with Profile]
|
||||
|
||||
E --> E1[Launch CDP Browser]
|
||||
E1 --> E2[Remote Debugging Active]
|
||||
|
||||
F --> F1{Browser Action?}
|
||||
F1 -->|start| F2[Start Builtin Browser]
|
||||
F1 -->|stop| F3[Stop Builtin Browser]
|
||||
F1 -->|status| F4[Check Browser Status]
|
||||
F1 -->|view| F5[Open Browser Window]
|
||||
|
||||
G --> G1{Config Action?}
|
||||
G1 -->|list| G2[Show All Settings]
|
||||
G1 -->|set| G3[Update Setting]
|
||||
G1 -->|get| G4[Read Setting]
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style C13 fill:#c8e6c9
|
||||
style C14 fill:#ffcdd2
|
||||
style D3 fill:#fff3e0
|
||||
style E2 fill:#f3e5f5
|
||||
```
|
||||
|
||||
### Profile Management Workflow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant CLI
|
||||
participant ProfileManager
|
||||
participant Browser
|
||||
participant FileSystem
|
||||
|
||||
User->>CLI: crwl profiles
|
||||
CLI->>ProfileManager: Initialize profile manager
|
||||
ProfileManager->>FileSystem: Scan for existing profiles
|
||||
FileSystem-->>ProfileManager: Profile list
|
||||
ProfileManager-->>CLI: Show interactive menu
|
||||
CLI-->>User: Display options
|
||||
|
||||
Note over User: User selects "Create new profile"
|
||||
|
||||
User->>CLI: Create profile "linkedin-auth"
|
||||
CLI->>ProfileManager: create_profile("linkedin-auth")
|
||||
ProfileManager->>FileSystem: Create profile directory
|
||||
ProfileManager->>Browser: Launch with new user data dir
|
||||
Browser-->>User: Opens browser window
|
||||
|
||||
Note over User: User manually logs in to LinkedIn
|
||||
|
||||
User->>Browser: Navigate and authenticate
|
||||
Browser->>FileSystem: Save cookies, session data
|
||||
User->>CLI: Press 'q' to save profile
|
||||
CLI->>ProfileManager: finalize_profile()
|
||||
ProfileManager->>FileSystem: Lock profile settings
|
||||
ProfileManager-->>CLI: Profile saved
|
||||
CLI-->>User: Profile "linkedin-auth" created
|
||||
|
||||
Note over User: Later usage
|
||||
|
||||
User->>CLI: crwl https://linkedin.com/feed -p linkedin-auth
|
||||
CLI->>ProfileManager: load_profile("linkedin-auth")
|
||||
ProfileManager->>FileSystem: Read profile data
|
||||
FileSystem-->>ProfileManager: User data directory
|
||||
ProfileManager-->>CLI: Profile configuration
|
||||
CLI->>Browser: Launch with existing profile
|
||||
Browser-->>CLI: Authenticated session ready
|
||||
CLI->>Browser: Navigate to target URL
|
||||
Browser-->>CLI: Crawl results with auth context
|
||||
CLI-->>User: Authenticated content
|
||||
```
|
||||
|
||||
### Browser Management State Machine
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> Stopped: Initial state
|
||||
|
||||
Stopped --> Starting: crwl browser start
|
||||
Starting --> Running: Browser launched
|
||||
Running --> Viewing: crwl browser view
|
||||
Viewing --> Running: Close window
|
||||
Running --> Stopping: crwl browser stop
|
||||
Stopping --> Stopped: Cleanup complete
|
||||
|
||||
Running --> Restarting: crwl browser restart
|
||||
Restarting --> Running: New browser instance
|
||||
|
||||
Stopped --> CDP_Mode: crwl cdp
|
||||
CDP_Mode --> CDP_Running: Remote debugging active
|
||||
CDP_Running --> CDP_Mode: Manual close
|
||||
CDP_Mode --> Stopped: Exit CDP
|
||||
|
||||
Running --> StatusCheck: crwl browser status
|
||||
StatusCheck --> Running: Return status
|
||||
|
||||
note right of Running : Port 9222 active\nBuiltin browser available
|
||||
note right of CDP_Running : Remote debugging\nManual control enabled
|
||||
note right of Viewing : Visual browser window\nDirect interaction
|
||||
```
|
||||
|
||||
### Authentication Workflow for Protected Sites
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Protected Site Access Needed] --> B[Create Profile Strategy]
|
||||
|
||||
B --> C{Existing Profile?}
|
||||
C -->|Yes| D[Test Profile Validity]
|
||||
C -->|No| E[Create New Profile]
|
||||
|
||||
D --> D1{Profile Valid?}
|
||||
D1 -->|Yes| F[Use Existing Profile]
|
||||
D1 -->|No| E
|
||||
|
||||
E --> E1[crwl profiles]
|
||||
E1 --> E2[Select Create New Profile]
|
||||
E2 --> E3[Enter Profile Name]
|
||||
E3 --> E4[Browser Opens for Auth]
|
||||
|
||||
E4 --> E5{Authentication Method?}
|
||||
E5 -->|Login Form| E6[Fill Username/Password]
|
||||
E5 -->|OAuth| E7[OAuth Flow]
|
||||
E5 -->|2FA| E8[Handle 2FA]
|
||||
E5 -->|Session Cookie| E9[Import Cookies]
|
||||
|
||||
E6 --> E10[Manual Login Process]
|
||||
E7 --> E10
|
||||
E8 --> E10
|
||||
E9 --> E10
|
||||
|
||||
E10 --> E11[Verify Authentication]
|
||||
E11 --> E12{Auth Successful?}
|
||||
E12 -->|Yes| E13[Save Profile - Press q]
|
||||
E12 -->|No| E10
|
||||
|
||||
E13 --> F
|
||||
F --> G[Execute Authenticated Crawl]
|
||||
|
||||
G --> H[crwl URL -p profile-name]
|
||||
H --> I[Load Profile Data]
|
||||
I --> J[Launch Browser with Auth]
|
||||
J --> K[Navigate to Protected Content]
|
||||
K --> L[Extract Authenticated Data]
|
||||
L --> M[Return Results]
|
||||
|
||||
style E4 fill:#fff3e0
|
||||
style E10 fill:#e3f2fd
|
||||
style F fill:#e8f5e8
|
||||
style M fill:#c8e6c9
|
||||
```
|
||||
|
||||
### CDP Browser Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "CLI Layer"
|
||||
A[crwl cdp command] --> B[CDP Manager]
|
||||
B --> C[Port Configuration]
|
||||
B --> D[Profile Selection]
|
||||
end
|
||||
|
||||
subgraph "Browser Process"
|
||||
E[Chromium/Firefox] --> F[Remote Debugging]
|
||||
F --> G[WebSocket Endpoint]
|
||||
G --> H[ws://localhost:9222]
|
||||
end
|
||||
|
||||
subgraph "Client Connections"
|
||||
I[Manual Browser Control] --> H
|
||||
J[DevTools Interface] --> H
|
||||
K[External Automation] --> H
|
||||
L[Crawl4AI Crawler] --> H
|
||||
end
|
||||
|
||||
subgraph "Profile Data"
|
||||
M[User Data Directory] --> E
|
||||
N[Cookies & Sessions] --> M
|
||||
O[Extensions] --> M
|
||||
P[Browser State] --> M
|
||||
end
|
||||
|
||||
A --> E
|
||||
C --> H
|
||||
D --> M
|
||||
|
||||
style H fill:#e3f2fd
|
||||
style E fill:#f3e5f5
|
||||
style M fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Configuration Management Hierarchy
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Global Configuration"
|
||||
A[~/.crawl4ai/config.yml] --> B[Default Settings]
|
||||
B --> C[LLM Providers]
|
||||
B --> D[Browser Defaults]
|
||||
B --> E[Output Preferences]
|
||||
end
|
||||
|
||||
subgraph "Profile Configuration"
|
||||
F[Profile Directory] --> G[Browser State]
|
||||
F --> H[Authentication Data]
|
||||
F --> I[Site-Specific Settings]
|
||||
end
|
||||
|
||||
subgraph "Command-Line Overrides"
|
||||
J[-b browser_config] --> K[Runtime Browser Settings]
|
||||
L[-c crawler_config] --> M[Runtime Crawler Settings]
|
||||
N[-o output_format] --> O[Runtime Output Format]
|
||||
end
|
||||
|
||||
subgraph "Configuration Files"
|
||||
P[browser.yml] --> Q[Browser Config Template]
|
||||
R[crawler.yml] --> S[Crawler Config Template]
|
||||
T[extract.yml] --> U[Extraction Config]
|
||||
end
|
||||
|
||||
subgraph "Resolution Order"
|
||||
V[Command Line Args] --> W[Config Files]
|
||||
W --> X[Profile Settings]
|
||||
X --> Y[Global Defaults]
|
||||
end
|
||||
|
||||
J --> V
|
||||
L --> V
|
||||
N --> V
|
||||
P --> W
|
||||
R --> W
|
||||
T --> W
|
||||
F --> X
|
||||
A --> Y
|
||||
|
||||
style V fill:#ffcdd2
|
||||
style W fill:#fff3e0
|
||||
style X fill:#e3f2fd
|
||||
style Y fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Identity-Based Crawling Decision Tree
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Target Website Assessment] --> B{Authentication Required?}
|
||||
|
||||
B -->|No| C[Standard Anonymous Crawl]
|
||||
B -->|Yes| D{Authentication Type?}
|
||||
|
||||
D -->|Login Form| E[Create Login Profile]
|
||||
D -->|OAuth/SSO| F[Create OAuth Profile]
|
||||
D -->|API Key/Token| G[Use Headers/Config]
|
||||
D -->|Session Cookies| H[Import Cookie Profile]
|
||||
|
||||
E --> E1[crwl profiles → Manual login]
|
||||
F --> F1[crwl profiles → OAuth flow]
|
||||
G --> G1[Configure headers in crawler config]
|
||||
H --> H1[Import cookies to profile]
|
||||
|
||||
E1 --> I[Test Authentication]
|
||||
F1 --> I
|
||||
G1 --> I
|
||||
H1 --> I
|
||||
|
||||
I --> J{Auth Test Success?}
|
||||
J -->|Yes| K[Production Crawl Setup]
|
||||
J -->|No| L[Debug Authentication]
|
||||
|
||||
L --> L1{Common Issues?}
|
||||
L1 -->|Rate Limiting| L2[Add delays/user simulation]
|
||||
L1 -->|Bot Detection| L3[Enable stealth mode]
|
||||
L1 -->|Session Expired| L4[Refresh authentication]
|
||||
L1 -->|CAPTCHA| L5[Manual intervention needed]
|
||||
|
||||
L2 --> M[Retry with Adjustments]
|
||||
L3 --> M
|
||||
L4 --> E1
|
||||
L5 --> N[Semi-automated approach]
|
||||
|
||||
M --> I
|
||||
N --> O[Manual auth + automated crawl]
|
||||
|
||||
K --> P[Automated Authenticated Crawling]
|
||||
O --> P
|
||||
C --> P
|
||||
|
||||
P --> Q[Monitor & Maintain Profiles]
|
||||
|
||||
style I fill:#fff3e0
|
||||
style K fill:#e8f5e8
|
||||
style P fill:#c8e6c9
|
||||
style L fill:#ffcdd2
|
||||
style N fill:#f3e5f5
|
||||
```
|
||||
|
||||
### CLI Usage Patterns and Best Practices
|
||||
|
||||
```mermaid
|
||||
timeline
|
||||
title CLI Workflow Evolution
|
||||
|
||||
section Setup Phase
|
||||
Installation : pip install crawl4ai
|
||||
: crawl4ai-setup
|
||||
Basic Test : crwl https://example.com
|
||||
Config Setup : crwl config set defaults
|
||||
|
||||
section Profile Creation
|
||||
Site Analysis : Identify auth requirements
|
||||
Profile Creation : crwl profiles
|
||||
Manual Login : Authenticate in browser
|
||||
Profile Save : Press 'q' to save
|
||||
|
||||
section Development Phase
|
||||
Test Crawls : crwl URL -p profile -v
|
||||
Config Tuning : Adjust browser/crawler settings
|
||||
Output Testing : Try different output formats
|
||||
Error Handling : Debug authentication issues
|
||||
|
||||
section Production Phase
|
||||
Automated Crawls : crwl URL -p profile -o json
|
||||
Batch Processing : Multiple URLs with same profile
|
||||
Monitoring : Check profile validity
|
||||
Maintenance : Update profiles as needed
|
||||
```
|
||||
|
||||
### Multi-Profile Management Strategy
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Profile Categories"
|
||||
A[Social Media Profiles]
|
||||
B[Work/Enterprise Profiles]
|
||||
C[E-commerce Profiles]
|
||||
D[Research Profiles]
|
||||
end
|
||||
|
||||
subgraph "Social Media"
|
||||
A --> A1[linkedin-personal]
|
||||
A --> A2[twitter-monitor]
|
||||
A --> A3[facebook-research]
|
||||
A --> A4[instagram-brand]
|
||||
end
|
||||
|
||||
subgraph "Enterprise"
|
||||
B --> B1[company-intranet]
|
||||
B --> B2[github-enterprise]
|
||||
B --> B3[confluence-docs]
|
||||
B --> B4[jira-tickets]
|
||||
end
|
||||
|
||||
subgraph "E-commerce"
|
||||
C --> C1[amazon-seller]
|
||||
C --> C2[shopify-admin]
|
||||
C --> C3[ebay-monitor]
|
||||
C --> C4[marketplace-competitor]
|
||||
end
|
||||
|
||||
subgraph "Research"
|
||||
D --> D1[academic-journals]
|
||||
D --> D2[data-platforms]
|
||||
D --> D3[survey-tools]
|
||||
D --> D4[government-portals]
|
||||
end
|
||||
|
||||
subgraph "Usage Patterns"
|
||||
E[Daily Monitoring] --> A2
|
||||
E --> B1
|
||||
F[Weekly Reports] --> C3
|
||||
F --> D2
|
||||
G[On-Demand Research] --> D1
|
||||
G --> D4
|
||||
H[Competitive Analysis] --> C4
|
||||
H --> A4
|
||||
end
|
||||
|
||||
style A1 fill:#e3f2fd
|
||||
style B1 fill:#f3e5f5
|
||||
style C1 fill:#e8f5e8
|
||||
style D1 fill:#fff3e0
|
||||
```
|
||||
|
||||
**📖 Learn more:** [CLI Reference](https://docs.crawl4ai.com/core/cli/), [Identity-Based Crawling](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Profile Management](https://docs.crawl4ai.com/advanced/session-management/), [Authentication Strategies](https://docs.crawl4ai.com/advanced/hooks-auth/)
|
||||
1421
docs/md_v2/assets/llm.txt/diagrams/config_objects.txt
Normal file
@@ -0,0 +1,401 @@
|
||||
## Deep Crawling Filters & Scorers Architecture
|
||||
|
||||
Visual representations of advanced URL filtering, scoring strategies, and performance optimization workflows for intelligent deep crawling.
|
||||
|
||||
### Filter Chain Processing Pipeline
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[URL Input] --> B{Domain Filter}
|
||||
B -->|✓ Pass| C{Pattern Filter}
|
||||
B -->|✗ Fail| X1[Reject: Invalid Domain]
|
||||
|
||||
C -->|✓ Pass| D{Content Type Filter}
|
||||
C -->|✗ Fail| X2[Reject: Pattern Mismatch]
|
||||
|
||||
D -->|✓ Pass| E{SEO Filter}
|
||||
D -->|✗ Fail| X3[Reject: Wrong Content Type]
|
||||
|
||||
E -->|✓ Pass| F{Content Relevance Filter}
|
||||
E -->|✗ Fail| X4[Reject: Low SEO Score]
|
||||
|
||||
F -->|✓ Pass| G[URL Accepted]
|
||||
F -->|✗ Fail| X5[Reject: Low Relevance]
|
||||
|
||||
G --> H[Add to Crawl Queue]
|
||||
|
||||
subgraph "Fast Filters"
|
||||
B
|
||||
C
|
||||
D
|
||||
end
|
||||
|
||||
subgraph "Slow Filters"
|
||||
E
|
||||
F
|
||||
end
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style G fill:#c8e6c9
|
||||
style H fill:#e8f5e8
|
||||
style X1 fill:#ffcdd2
|
||||
style X2 fill:#ffcdd2
|
||||
style X3 fill:#ffcdd2
|
||||
style X4 fill:#ffcdd2
|
||||
style X5 fill:#ffcdd2
|
||||
```
|
||||
|
||||
### URL Scoring System Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Input URL"
|
||||
A[https://python.org/tutorial/2024/ml-guide.html]
|
||||
end
|
||||
|
||||
subgraph "Individual Scorers"
|
||||
B[Keyword Relevance Scorer]
|
||||
C[Path Depth Scorer]
|
||||
D[Content Type Scorer]
|
||||
E[Freshness Scorer]
|
||||
F[Domain Authority Scorer]
|
||||
end
|
||||
|
||||
subgraph "Scoring Process"
|
||||
B --> B1[Keywords: python, tutorial, ml<br/>Score: 0.85]
|
||||
C --> C1[Depth: 4 levels<br/>Optimal: 3<br/>Score: 0.75]
|
||||
D --> D1[Content: HTML<br/>Score: 1.0]
|
||||
E --> E1[Year: 2024<br/>Score: 1.0]
|
||||
F --> F1[Domain: python.org<br/>Score: 1.0]
|
||||
end
|
||||
|
||||
subgraph "Composite Scoring"
|
||||
G[Weighted Combination]
|
||||
B1 --> G
|
||||
C1 --> G
|
||||
D1 --> G
|
||||
E1 --> G
|
||||
F1 --> G
|
||||
end
|
||||
|
||||
subgraph "Final Result"
|
||||
H[Composite Score: 0.92]
|
||||
I{Score > Threshold?}
|
||||
J[Accept URL]
|
||||
K[Reject URL]
|
||||
end
|
||||
|
||||
A --> B
|
||||
A --> C
|
||||
A --> D
|
||||
A --> E
|
||||
A --> F
|
||||
|
||||
G --> H
|
||||
H --> I
|
||||
I -->|✓ 0.92 > 0.6| J
|
||||
I -->|✗ Score too low| K
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style G fill:#fff3e0
|
||||
style H fill:#e8f5e8
|
||||
style J fill:#c8e6c9
|
||||
style K fill:#ffcdd2
|
||||
```
|
||||
|
||||
### Filter vs Scorer Decision Matrix
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[URL Processing Decision] --> B{Binary Decision Needed?}
|
||||
|
||||
B -->|Yes - Include/Exclude| C[Use Filters]
|
||||
B -->|No - Quality Rating| D[Use Scorers]
|
||||
|
||||
C --> C1{Filter Type Needed?}
|
||||
C1 -->|Domain Control| C2[DomainFilter]
|
||||
C1 -->|Pattern Matching| C3[URLPatternFilter]
|
||||
C1 -->|Content Type| C4[ContentTypeFilter]
|
||||
C1 -->|SEO Quality| C5[SEOFilter]
|
||||
C1 -->|Content Relevance| C6[ContentRelevanceFilter]
|
||||
|
||||
D --> D1{Scoring Criteria?}
|
||||
D1 -->|Keyword Relevance| D2[KeywordRelevanceScorer]
|
||||
D1 -->|URL Structure| D3[PathDepthScorer]
|
||||
D1 -->|Content Quality| D4[ContentTypeScorer]
|
||||
D1 -->|Time Sensitivity| D5[FreshnessScorer]
|
||||
D1 -->|Source Authority| D6[DomainAuthorityScorer]
|
||||
|
||||
C2 --> E[Chain Filters]
|
||||
C3 --> E
|
||||
C4 --> E
|
||||
C5 --> E
|
||||
C6 --> E
|
||||
|
||||
D2 --> F[Composite Scorer]
|
||||
D3 --> F
|
||||
D4 --> F
|
||||
D5 --> F
|
||||
D6 --> F
|
||||
|
||||
E --> G[Binary Output: Pass/Fail]
|
||||
F --> H[Numeric Score: 0.0-1.0]
|
||||
|
||||
G --> I[Apply to URL Queue]
|
||||
H --> J[Priority Ranking]
|
||||
|
||||
style C fill:#e8f5e8
|
||||
style D fill:#fff3e0
|
||||
style E fill:#f3e5f5
|
||||
style F fill:#e3f2fd
|
||||
style G fill:#c8e6c9
|
||||
style H fill:#ffecb3
|
||||
```
|
||||
|
||||
### Performance Optimization Strategy
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Queue as URL Queue
|
||||
participant Fast as Fast Filters
|
||||
participant Slow as Slow Filters
|
||||
participant Score as Scorers
|
||||
participant Output as Filtered URLs
|
||||
|
||||
Note over Queue, Output: Batch Processing (1000 URLs)
|
||||
|
||||
Queue->>Fast: Apply Domain Filter
|
||||
Fast-->>Queue: 60% passed (600 URLs)
|
||||
|
||||
Queue->>Fast: Apply Pattern Filter
|
||||
Fast-->>Queue: 70% passed (420 URLs)
|
||||
|
||||
Queue->>Fast: Apply Content Type Filter
|
||||
Fast-->>Queue: 90% passed (378 URLs)
|
||||
|
||||
Note over Fast: Fast filters eliminate 62% of URLs
|
||||
|
||||
Queue->>Slow: Apply SEO Filter (378 URLs)
|
||||
Slow-->>Queue: 80% passed (302 URLs)
|
||||
|
||||
Queue->>Slow: Apply Relevance Filter
|
||||
Slow-->>Queue: 75% passed (227 URLs)
|
||||
|
||||
Note over Slow: Content analysis on remaining URLs
|
||||
|
||||
Queue->>Score: Calculate Composite Scores
|
||||
Score-->>Queue: Scored and ranked
|
||||
|
||||
Queue->>Output: Top 100 URLs by score
|
||||
Output-->>Queue: Processing complete
|
||||
|
||||
Note over Queue, Output: Total: 90% filtered out, 10% high-quality URLs retained
|
||||
```
|
||||
|
||||
### Custom Filter Implementation Flow
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> Planning
|
||||
|
||||
Planning --> IdentifyNeeds: Define filtering criteria
|
||||
IdentifyNeeds --> ChooseType: Binary vs Scoring decision
|
||||
|
||||
ChooseType --> FilterImpl: Binary decision needed
|
||||
ChooseType --> ScorerImpl: Quality rating needed
|
||||
|
||||
FilterImpl --> InheritURLFilter: Extend URLFilter base class
|
||||
ScorerImpl --> InheritURLScorer: Extend URLScorer base class
|
||||
|
||||
InheritURLFilter --> ImplementApply: def apply(url) -> bool
|
||||
InheritURLScorer --> ImplementScore: def _calculate_score(url) -> float
|
||||
|
||||
ImplementApply --> AddLogic: Add custom filtering logic
|
||||
ImplementScore --> AddLogic
|
||||
|
||||
AddLogic --> TestFilter: Unit testing
|
||||
TestFilter --> OptimizePerf: Performance optimization
|
||||
|
||||
OptimizePerf --> Integration: Integrate with FilterChain
|
||||
Integration --> Production: Deploy to production
|
||||
|
||||
Production --> Monitor: Monitor performance
|
||||
Monitor --> Tune: Tune parameters
|
||||
Tune --> Production
|
||||
|
||||
note right of Planning : Consider performance impact
|
||||
note right of AddLogic : Handle edge cases
|
||||
note right of OptimizePerf : Cache frequently accessed data
|
||||
```
|
||||
|
||||
### Filter Chain Optimization Patterns
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Naive Approach - Poor Performance"
|
||||
A1[All URLs] --> B1[Slow Filter 1]
|
||||
B1 --> C1[Slow Filter 2]
|
||||
C1 --> D1[Fast Filter 1]
|
||||
D1 --> E1[Fast Filter 2]
|
||||
E1 --> F1[Final Results]
|
||||
|
||||
B1 -.->|High CPU| G1[Performance Issues]
|
||||
C1 -.->|Network Calls| G1
|
||||
end
|
||||
|
||||
subgraph "Optimized Approach - High Performance"
|
||||
A2[All URLs] --> B2[Fast Filter 1]
|
||||
B2 --> C2[Fast Filter 2]
|
||||
C2 --> D2[Batch Process]
|
||||
D2 --> E2[Slow Filter 1]
|
||||
E2 --> F2[Slow Filter 2]
|
||||
F2 --> G2[Final Results]
|
||||
|
||||
D2 --> H2[Concurrent Processing]
|
||||
H2 --> I2[Semaphore Control]
|
||||
end
|
||||
|
||||
subgraph "Performance Metrics"
|
||||
J[Processing Time]
|
||||
K[Memory Usage]
|
||||
L[CPU Utilization]
|
||||
M[Network Requests]
|
||||
end
|
||||
|
||||
G1 -.-> J
|
||||
G1 -.-> K
|
||||
G1 -.-> L
|
||||
G1 -.-> M
|
||||
|
||||
G2 -.-> J
|
||||
G2 -.-> K
|
||||
G2 -.-> L
|
||||
G2 -.-> M
|
||||
|
||||
style A1 fill:#ffcdd2
|
||||
style G1 fill:#ffcdd2
|
||||
style A2 fill:#c8e6c9
|
||||
style G2 fill:#c8e6c9
|
||||
style H2 fill:#e8f5e8
|
||||
style I2 fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Composite Scoring Weight Distribution
|
||||
|
||||
```mermaid
|
||||
pie title Composite Scorer Weight Distribution
|
||||
"Keyword Relevance (30%)" : 30
|
||||
"Domain Authority (25%)" : 25
|
||||
"Content Type (20%)" : 20
|
||||
"Freshness (15%)" : 15
|
||||
"Path Depth (10%)" : 10
|
||||
```
|
||||
|
||||
### Deep Crawl Integration Architecture
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Deep Crawl Strategy"
|
||||
A[Start URL] --> B[Extract Links]
|
||||
B --> C[Apply Filter Chain]
|
||||
C --> D[Calculate Scores]
|
||||
D --> E[Priority Queue]
|
||||
E --> F[Crawl Next URL]
|
||||
F --> B
|
||||
end
|
||||
|
||||
subgraph "Filter Chain Components"
|
||||
C --> C1[Domain Filter]
|
||||
C --> C2[Pattern Filter]
|
||||
C --> C3[Content Filter]
|
||||
C --> C4[SEO Filter]
|
||||
C --> C5[Relevance Filter]
|
||||
end
|
||||
|
||||
subgraph "Scoring Components"
|
||||
D --> D1[Keyword Scorer]
|
||||
D --> D2[Depth Scorer]
|
||||
D --> D3[Freshness Scorer]
|
||||
D --> D4[Authority Scorer]
|
||||
D --> D5[Composite Score]
|
||||
end
|
||||
|
||||
subgraph "Queue Management"
|
||||
E --> E1{Score > Threshold?}
|
||||
E1 -->|Yes| E2[High Priority Queue]
|
||||
E1 -->|No| E3[Low Priority Queue]
|
||||
E2 --> F
|
||||
E3 --> G[Delayed Processing]
|
||||
end
|
||||
|
||||
subgraph "Control Flow"
|
||||
H{Max Depth Reached?}
|
||||
I{Max Pages Reached?}
|
||||
J[Stop Crawling]
|
||||
end
|
||||
|
||||
F --> H
|
||||
H -->|No| I
|
||||
H -->|Yes| J
|
||||
I -->|No| B
|
||||
I -->|Yes| J
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style E2 fill:#c8e6c9
|
||||
style E3 fill:#fff3e0
|
||||
style J fill:#ffcdd2
|
||||
```
|
||||
|
||||
### Filter Performance Comparison
|
||||
|
||||
```mermaid
|
||||
xychart-beta
|
||||
title "Filter Performance Comparison (1000 URLs)"
|
||||
x-axis [Domain, Pattern, ContentType, SEO, Relevance]
|
||||
y-axis "Processing Time (ms)" 0 --> 1000
|
||||
bar [50, 80, 45, 300, 800]
|
||||
```
|
||||
|
||||
### Scoring Algorithm Workflow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Input URL] --> B[Parse URL Components]
|
||||
B --> C[Extract Features]
|
||||
|
||||
C --> D[Domain Analysis]
|
||||
C --> E[Path Analysis]
|
||||
C --> F[Content Type Detection]
|
||||
C --> G[Keyword Extraction]
|
||||
C --> H[Freshness Detection]
|
||||
|
||||
D --> I[Domain Authority Score]
|
||||
E --> J[Path Depth Score]
|
||||
F --> K[Content Type Score]
|
||||
G --> L[Keyword Relevance Score]
|
||||
H --> M[Freshness Score]
|
||||
|
||||
I --> N[Apply Weights]
|
||||
J --> N
|
||||
K --> N
|
||||
L --> N
|
||||
M --> N
|
||||
|
||||
N --> O[Normalize Scores]
|
||||
O --> P[Calculate Final Score]
|
||||
P --> Q{Score >= Threshold?}
|
||||
|
||||
Q -->|Yes| R[Accept for Crawling]
|
||||
Q -->|No| S[Reject URL]
|
||||
|
||||
R --> T[Add to Priority Queue]
|
||||
S --> U[Log Rejection Reason]
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style P fill:#fff3e0
|
||||
style R fill:#c8e6c9
|
||||
style S fill:#ffcdd2
|
||||
style T fill:#e8f5e8
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Deep Crawling Strategy](https://docs.crawl4ai.com/core/deep-crawling/), [Performance Optimization](https://docs.crawl4ai.com/advanced/performance-tuning/), [Custom Implementations](https://docs.crawl4ai.com/advanced/custom-filters/)
|
||||
428
docs/md_v2/assets/llm.txt/diagrams/deep_crawling.txt
Normal file
@@ -0,0 +1,428 @@
|
||||
## Deep Crawling Workflows and Architecture
|
||||
|
||||
Visual representations of multi-level website exploration, filtering strategies, and intelligent crawling patterns.
|
||||
|
||||
### Deep Crawl Strategy Overview
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Start Deep Crawl] --> B{Strategy Selection}
|
||||
|
||||
B -->|Explore All Levels| C[BFS Strategy]
|
||||
B -->|Dive Deep Fast| D[DFS Strategy]
|
||||
B -->|Smart Prioritization| E[Best-First Strategy]
|
||||
|
||||
C --> C1[Breadth-First Search]
|
||||
C1 --> C2[Process all depth 0 links]
|
||||
C2 --> C3[Process all depth 1 links]
|
||||
C3 --> C4[Continue by depth level]
|
||||
|
||||
D --> D1[Depth-First Search]
|
||||
D1 --> D2[Follow first link deeply]
|
||||
D2 --> D3[Backtrack when max depth reached]
|
||||
D3 --> D4[Continue with next branch]
|
||||
|
||||
E --> E1[Best-First Search]
|
||||
E1 --> E2[Score all discovered URLs]
|
||||
E2 --> E3[Process highest scoring URLs first]
|
||||
E3 --> E4[Continuously re-prioritize queue]
|
||||
|
||||
C4 --> F[Apply Filters]
|
||||
D4 --> F
|
||||
E4 --> F
|
||||
|
||||
F --> G{Filter Chain Processing}
|
||||
G -->|Domain Filter| G1[Check allowed/blocked domains]
|
||||
G -->|URL Pattern Filter| G2[Match URL patterns]
|
||||
G -->|Content Type Filter| G3[Verify content types]
|
||||
G -->|SEO Filter| G4[Evaluate SEO quality]
|
||||
G -->|Content Relevance| G5[Score content relevance]
|
||||
|
||||
G1 --> H{Passed All Filters?}
|
||||
G2 --> H
|
||||
G3 --> H
|
||||
G4 --> H
|
||||
G5 --> H
|
||||
|
||||
H -->|Yes| I[Add to Crawl Queue]
|
||||
H -->|No| J[Discard URL]
|
||||
|
||||
I --> K{Processing Mode}
|
||||
K -->|Streaming| L[Process Immediately]
|
||||
K -->|Batch| M[Collect All Results]
|
||||
|
||||
L --> N[Stream Result to User]
|
||||
M --> O[Return Complete Result Set]
|
||||
|
||||
J --> P{More URLs in Queue?}
|
||||
N --> P
|
||||
O --> P
|
||||
|
||||
P -->|Yes| Q{Within Limits?}
|
||||
P -->|No| R[Deep Crawl Complete]
|
||||
|
||||
Q -->|Max Depth OK| S{Max Pages OK}
|
||||
Q -->|Max Depth Exceeded| T[Skip Deeper URLs]
|
||||
|
||||
S -->|Under Limit| U[Continue Crawling]
|
||||
S -->|Limit Reached| R
|
||||
|
||||
T --> P
|
||||
U --> F
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style R fill:#c8e6c9
|
||||
style C fill:#fff3e0
|
||||
style D fill:#f3e5f5
|
||||
style E fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Deep Crawl Strategy Comparison
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "BFS - Breadth-First Search"
|
||||
BFS1[Level 0: Start URL]
|
||||
BFS2[Level 1: All direct links]
|
||||
BFS3[Level 2: All second-level links]
|
||||
BFS4[Level 3: All third-level links]
|
||||
|
||||
BFS1 --> BFS2
|
||||
BFS2 --> BFS3
|
||||
BFS3 --> BFS4
|
||||
|
||||
BFS_NOTE[Complete each depth before going deeper<br/>Good for site mapping<br/>Memory intensive for wide sites]
|
||||
end
|
||||
|
||||
subgraph "DFS - Depth-First Search"
|
||||
DFS1[Start URL]
|
||||
DFS2[First Link → Deep]
|
||||
DFS3[Follow until max depth]
|
||||
DFS4[Backtrack and try next]
|
||||
|
||||
DFS1 --> DFS2
|
||||
DFS2 --> DFS3
|
||||
DFS3 --> DFS4
|
||||
DFS4 --> DFS2
|
||||
|
||||
DFS_NOTE[Go deep on first path<br/>Memory efficient<br/>May miss important pages]
|
||||
end
|
||||
|
||||
subgraph "Best-First - Priority Queue"
|
||||
BF1[Start URL]
|
||||
BF2[Score all discovered links]
|
||||
BF3[Process highest scoring first]
|
||||
BF4[Continuously re-prioritize]
|
||||
|
||||
BF1 --> BF2
|
||||
BF2 --> BF3
|
||||
BF3 --> BF4
|
||||
BF4 --> BF2
|
||||
|
||||
BF_NOTE[Intelligent prioritization<br/>Finds relevant content fast<br/>Recommended for most use cases]
|
||||
end
|
||||
|
||||
style BFS1 fill:#e3f2fd
|
||||
style DFS1 fill:#f3e5f5
|
||||
style BF1 fill:#e8f5e8
|
||||
style BFS_NOTE fill:#fff3e0
|
||||
style DFS_NOTE fill:#fff3e0
|
||||
style BF_NOTE fill:#fff3e0
|
||||
```
|
||||
|
||||
### Filter Chain Processing Sequence
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant URL as Discovered URL
|
||||
participant Chain as Filter Chain
|
||||
participant Domain as Domain Filter
|
||||
participant Pattern as URL Pattern Filter
|
||||
participant Content as Content Type Filter
|
||||
participant SEO as SEO Filter
|
||||
participant Relevance as Content Relevance Filter
|
||||
participant Queue as Crawl Queue
|
||||
|
||||
URL->>Chain: Process URL
|
||||
Chain->>Domain: Check domain rules
|
||||
|
||||
alt Domain Allowed
|
||||
Domain-->>Chain: ✓ Pass
|
||||
Chain->>Pattern: Check URL patterns
|
||||
|
||||
alt Pattern Matches
|
||||
Pattern-->>Chain: ✓ Pass
|
||||
Chain->>Content: Check content type
|
||||
|
||||
alt Content Type Valid
|
||||
Content-->>Chain: ✓ Pass
|
||||
Chain->>SEO: Evaluate SEO quality
|
||||
|
||||
alt SEO Score Above Threshold
|
||||
SEO-->>Chain: ✓ Pass
|
||||
Chain->>Relevance: Score content relevance
|
||||
|
||||
alt Relevance Score High
|
||||
Relevance-->>Chain: ✓ Pass
|
||||
Chain->>Queue: Add to crawl queue
|
||||
Queue-->>URL: Queued for crawling
|
||||
else Relevance Score Low
|
||||
Relevance-->>Chain: ✗ Reject
|
||||
Chain-->>URL: Filtered out - Low relevance
|
||||
end
|
||||
else SEO Score Low
|
||||
SEO-->>Chain: ✗ Reject
|
||||
Chain-->>URL: Filtered out - Poor SEO
|
||||
end
|
||||
else Invalid Content Type
|
||||
Content-->>Chain: ✗ Reject
|
||||
Chain-->>URL: Filtered out - Wrong content type
|
||||
end
|
||||
else Pattern Mismatch
|
||||
Pattern-->>Chain: ✗ Reject
|
||||
Chain-->>URL: Filtered out - Pattern mismatch
|
||||
end
|
||||
else Domain Blocked
|
||||
Domain-->>Chain: ✗ Reject
|
||||
Chain-->>URL: Filtered out - Blocked domain
|
||||
end
|
||||
```
|
||||
|
||||
### URL Lifecycle State Machine
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> Discovered: Found on page
|
||||
|
||||
Discovered --> FilterPending: Enter filter chain
|
||||
|
||||
FilterPending --> DomainCheck: Apply domain filter
|
||||
DomainCheck --> PatternCheck: Domain allowed
|
||||
DomainCheck --> Rejected: Domain blocked
|
||||
|
||||
PatternCheck --> ContentCheck: Pattern matches
|
||||
PatternCheck --> Rejected: Pattern mismatch
|
||||
|
||||
ContentCheck --> SEOCheck: Content type valid
|
||||
ContentCheck --> Rejected: Invalid content
|
||||
|
||||
SEOCheck --> RelevanceCheck: SEO score sufficient
|
||||
SEOCheck --> Rejected: Poor SEO score
|
||||
|
||||
RelevanceCheck --> Scored: Relevance score calculated
|
||||
RelevanceCheck --> Rejected: Low relevance
|
||||
|
||||
Scored --> Queued: Added to priority queue
|
||||
|
||||
Queued --> Crawling: Selected for processing
|
||||
Crawling --> Success: Page crawled successfully
|
||||
Crawling --> Failed: Crawl failed
|
||||
|
||||
Success --> LinkExtraction: Extract new links
|
||||
LinkExtraction --> [*]: Process complete
|
||||
|
||||
Failed --> [*]: Record failure
|
||||
Rejected --> [*]: Log rejection reason
|
||||
|
||||
note right of Scored : Score determines priority<br/>in Best-First strategy
|
||||
|
||||
note right of Failed : Errors logged with<br/>depth and reason
|
||||
```
|
||||
|
||||
### Streaming vs Batch Processing Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Input"
|
||||
A[Start URL] --> B[Deep Crawl Strategy]
|
||||
end
|
||||
|
||||
subgraph "Crawl Engine"
|
||||
B --> C[URL Discovery]
|
||||
C --> D[Filter Chain]
|
||||
D --> E[Priority Queue]
|
||||
E --> F[Page Processor]
|
||||
end
|
||||
|
||||
subgraph "Streaming Mode stream=True"
|
||||
F --> G1[Process Page]
|
||||
G1 --> H1[Extract Content]
|
||||
H1 --> I1[Yield Result Immediately]
|
||||
I1 --> J1[async for result]
|
||||
J1 --> K1[Real-time Processing]
|
||||
|
||||
G1 --> L1[Extract Links]
|
||||
L1 --> M1[Add to Queue]
|
||||
M1 --> F
|
||||
end
|
||||
|
||||
subgraph "Batch Mode stream=False"
|
||||
F --> G2[Process Page]
|
||||
G2 --> H2[Extract Content]
|
||||
H2 --> I2[Store Result]
|
||||
I2 --> N2[Result Collection]
|
||||
|
||||
G2 --> L2[Extract Links]
|
||||
L2 --> M2[Add to Queue]
|
||||
M2 --> O2{More URLs?}
|
||||
O2 -->|Yes| F
|
||||
O2 -->|No| P2[Return All Results]
|
||||
P2 --> Q2[Batch Processing]
|
||||
end
|
||||
|
||||
style I1 fill:#e8f5e8
|
||||
style K1 fill:#e8f5e8
|
||||
style P2 fill:#e3f2fd
|
||||
style Q2 fill:#e3f2fd
|
||||
```
|
||||
|
||||
### Advanced Scoring and Prioritization System
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph "URL Discovery"
|
||||
A[Page Links] --> B[Extract URLs]
|
||||
B --> C[Normalize URLs]
|
||||
end
|
||||
|
||||
subgraph "Scoring System"
|
||||
C --> D[Keyword Relevance Scorer]
|
||||
D --> D1[URL Text Analysis]
|
||||
D --> D2[Keyword Matching]
|
||||
D --> D3[Calculate Base Score]
|
||||
|
||||
D3 --> E[Additional Scoring Factors]
|
||||
E --> E1[URL Structure weight: 0.2]
|
||||
E --> E2[Link Context weight: 0.3]
|
||||
E --> E3[Page Depth Penalty weight: 0.1]
|
||||
E --> E4[Domain Authority weight: 0.4]
|
||||
|
||||
D1 --> F[Combined Score]
|
||||
D2 --> F
|
||||
D3 --> F
|
||||
E1 --> F
|
||||
E2 --> F
|
||||
E3 --> F
|
||||
E4 --> F
|
||||
end
|
||||
|
||||
subgraph "Prioritization"
|
||||
F --> G{Score Threshold}
|
||||
G -->|Above Threshold| H[Priority Queue]
|
||||
G -->|Below Threshold| I[Discard URL]
|
||||
|
||||
H --> J[Best-First Selection]
|
||||
J --> K[Highest Score First]
|
||||
K --> L[Process Page]
|
||||
|
||||
L --> M[Update Scores]
|
||||
M --> N[Re-prioritize Queue]
|
||||
N --> J
|
||||
end
|
||||
|
||||
style F fill:#fff3e0
|
||||
style H fill:#e8f5e8
|
||||
style L fill:#e3f2fd
|
||||
```
|
||||
|
||||
### Deep Crawl Performance and Limits
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Crawl Constraints"
|
||||
A[Max Depth: 2] --> A1[Prevents infinite crawling]
|
||||
B[Max Pages: 50] --> B1[Controls resource usage]
|
||||
C[Score Threshold: 0.3] --> C1[Quality filtering]
|
||||
D[Domain Limits] --> D1[Scope control]
|
||||
end
|
||||
|
||||
subgraph "Performance Monitoring"
|
||||
E[Pages Crawled] --> F[Depth Distribution]
|
||||
E --> G[Success Rate]
|
||||
E --> H[Average Score]
|
||||
E --> I[Processing Time]
|
||||
|
||||
F --> J[Performance Report]
|
||||
G --> J
|
||||
H --> J
|
||||
I --> J
|
||||
end
|
||||
|
||||
subgraph "Resource Management"
|
||||
K[Memory Usage] --> L{Memory Threshold}
|
||||
L -->|Under Limit| M[Continue Crawling]
|
||||
L -->|Over Limit| N[Reduce Concurrency]
|
||||
|
||||
O[CPU Usage] --> P{CPU Threshold}
|
||||
P -->|Normal| M
|
||||
P -->|High| Q[Add Delays]
|
||||
|
||||
R[Network Load] --> S{Rate Limits}
|
||||
S -->|OK| M
|
||||
S -->|Exceeded| T[Throttle Requests]
|
||||
end
|
||||
|
||||
M --> U[Optimal Performance]
|
||||
N --> V[Reduced Performance]
|
||||
Q --> V
|
||||
T --> V
|
||||
|
||||
style U fill:#c8e6c9
|
||||
style V fill:#fff3e0
|
||||
style J fill:#e3f2fd
|
||||
```
|
||||
|
||||
### Error Handling and Recovery Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Strategy as Deep Crawl Strategy
|
||||
participant Queue as Priority Queue
|
||||
participant Crawler as Page Crawler
|
||||
participant Error as Error Handler
|
||||
participant Result as Result Collector
|
||||
|
||||
Strategy->>Queue: Get next URL
|
||||
Queue-->>Strategy: Return highest priority URL
|
||||
|
||||
Strategy->>Crawler: Crawl page
|
||||
|
||||
alt Successful Crawl
|
||||
Crawler-->>Strategy: Return page content
|
||||
Strategy->>Result: Store successful result
|
||||
Strategy->>Strategy: Extract new links
|
||||
Strategy->>Queue: Add new URLs to queue
|
||||
else Network Error
|
||||
Crawler-->>Error: Network timeout/failure
|
||||
Error->>Error: Log error with details
|
||||
Error->>Queue: Mark URL as failed
|
||||
Error-->>Strategy: Skip to next URL
|
||||
else Parse Error
|
||||
Crawler-->>Error: HTML parsing failed
|
||||
Error->>Error: Log parse error
|
||||
Error->>Result: Store failed result
|
||||
Error-->>Strategy: Continue with next URL
|
||||
else Rate Limit Hit
|
||||
Crawler-->>Error: Rate limit exceeded
|
||||
Error->>Error: Apply backoff strategy
|
||||
Error->>Queue: Re-queue URL with delay
|
||||
Error-->>Strategy: Wait before retry
|
||||
else Depth Limit
|
||||
Strategy->>Strategy: Check depth constraint
|
||||
Strategy-->>Queue: Skip URL - too deep
|
||||
else Page Limit
|
||||
Strategy->>Strategy: Check page count
|
||||
Strategy-->>Result: Stop crawling - limit reached
|
||||
end
|
||||
|
||||
Strategy->>Queue: Request next URL
|
||||
Queue-->>Strategy: More URLs available?
|
||||
|
||||
alt Queue Empty
|
||||
Queue-->>Result: Crawl complete
|
||||
else Queue Has URLs
|
||||
Queue-->>Strategy: Continue crawling
|
||||
end
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Deep Crawling Strategies](https://docs.crawl4ai.com/core/deep-crawling/), [Content Filtering](https://docs.crawl4ai.com/core/content-selection/), [Advanced Crawling Patterns](https://docs.crawl4ai.com/advanced/advanced-features/)
|
||||
603
docs/md_v2/assets/llm.txt/diagrams/docker.txt
Normal file
@@ -0,0 +1,603 @@
|
||||
## Docker Deployment Architecture and Workflows
|
||||
|
||||
Visual representations of Crawl4AI Docker deployment, API architecture, configuration management, and service interactions.
|
||||
|
||||
### Docker Deployment Decision Flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Start Docker Deployment] --> B{Deployment Type?}
|
||||
|
||||
B -->|Quick Start| C[Pre-built Image]
|
||||
B -->|Development| D[Docker Compose]
|
||||
B -->|Custom Build| E[Manual Build]
|
||||
B -->|Production| F[Production Setup]
|
||||
|
||||
C --> C1[docker pull unclecode/crawl4ai]
|
||||
C1 --> C2{Need LLM Support?}
|
||||
C2 -->|Yes| C3[Setup .llm.env]
|
||||
C2 -->|No| C4[Basic run]
|
||||
C3 --> C5[docker run with --env-file]
|
||||
C4 --> C6[docker run basic]
|
||||
|
||||
D --> D1[git clone repository]
|
||||
D1 --> D2[cp .llm.env.example .llm.env]
|
||||
D2 --> D3{Build Type?}
|
||||
D3 -->|Pre-built| D4[IMAGE=latest docker compose up]
|
||||
D3 -->|Local Build| D5[docker compose up --build]
|
||||
D3 -->|All Features| D6[INSTALL_TYPE=all docker compose up]
|
||||
|
||||
E --> E1[docker buildx build]
|
||||
E1 --> E2{Architecture?}
|
||||
E2 -->|Single| E3[--platform linux/amd64]
|
||||
E2 -->|Multi| E4[--platform linux/amd64,linux/arm64]
|
||||
E3 --> E5[Build complete]
|
||||
E4 --> E5
|
||||
|
||||
F --> F1[Production configuration]
|
||||
F1 --> F2[Custom config.yml]
|
||||
F2 --> F3[Resource limits]
|
||||
F3 --> F4[Health monitoring]
|
||||
F4 --> F5[Production ready]
|
||||
|
||||
C5 --> G[Service running on :11235]
|
||||
C6 --> G
|
||||
D4 --> G
|
||||
D5 --> G
|
||||
D6 --> G
|
||||
E5 --> H[docker run custom image]
|
||||
H --> G
|
||||
F5 --> I[Production deployment]
|
||||
|
||||
G --> J[Access playground at /playground]
|
||||
G --> K[Health check at /health]
|
||||
I --> L[Production monitoring]
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style G fill:#c8e6c9
|
||||
style I fill:#c8e6c9
|
||||
style J fill:#fff3e0
|
||||
style K fill:#fff3e0
|
||||
style L fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Docker Container Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Host Environment"
|
||||
A[Docker Engine] --> B[Crawl4AI Container]
|
||||
C[.llm.env] --> B
|
||||
D[Custom config.yml] --> B
|
||||
E[Port 11235] --> B
|
||||
F[Shared Memory 1GB+] --> B
|
||||
end
|
||||
|
||||
subgraph "Container Services"
|
||||
B --> G[FastAPI Server :8020]
|
||||
B --> H[Gunicorn WSGI]
|
||||
B --> I[Supervisord Process Manager]
|
||||
B --> J[Redis Cache :6379]
|
||||
|
||||
G --> K[REST API Endpoints]
|
||||
G --> L[WebSocket Connections]
|
||||
G --> M[MCP Protocol]
|
||||
|
||||
H --> N[Worker Processes]
|
||||
I --> O[Service Monitoring]
|
||||
J --> P[Request Caching]
|
||||
end
|
||||
|
||||
subgraph "Browser Management"
|
||||
B --> Q[Playwright Framework]
|
||||
Q --> R[Chromium Browser]
|
||||
Q --> S[Firefox Browser]
|
||||
Q --> T[WebKit Browser]
|
||||
|
||||
R --> U[Browser Pool]
|
||||
S --> U
|
||||
T --> U
|
||||
|
||||
U --> V[Page Sessions]
|
||||
U --> W[Context Management]
|
||||
end
|
||||
|
||||
subgraph "External Services"
|
||||
X[OpenAI API] -.-> K
|
||||
Y[Anthropic Claude] -.-> K
|
||||
Z[Local Ollama] -.-> K
|
||||
AA[Groq API] -.-> K
|
||||
BB[Google Gemini] -.-> K
|
||||
end
|
||||
|
||||
subgraph "Client Interactions"
|
||||
CC[Python SDK] --> K
|
||||
DD[REST API Calls] --> K
|
||||
EE[MCP Clients] --> M
|
||||
FF[Web Browser] --> G
|
||||
GG[Monitoring Tools] --> K
|
||||
end
|
||||
|
||||
style B fill:#e3f2fd
|
||||
style G fill:#f3e5f5
|
||||
style Q fill:#e8f5e8
|
||||
style K fill:#fff3e0
|
||||
```
|
||||
|
||||
### API Endpoints Architecture
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Core Endpoints"
|
||||
A[/crawl] --> A1[Single URL crawl]
|
||||
A2[/crawl/stream] --> A3[Streaming multi-URL]
|
||||
A4[/crawl/job] --> A5[Async job submission]
|
||||
A6[/crawl/job/{id}] --> A7[Job status check]
|
||||
end
|
||||
|
||||
subgraph "Specialized Endpoints"
|
||||
B[/html] --> B1[Preprocessed HTML]
|
||||
B2[/screenshot] --> B3[PNG capture]
|
||||
B4[/pdf] --> B5[PDF generation]
|
||||
B6[/execute_js] --> B7[JavaScript execution]
|
||||
B8[/md] --> B9[Markdown extraction]
|
||||
end
|
||||
|
||||
subgraph "Utility Endpoints"
|
||||
C[/health] --> C1[Service status]
|
||||
C2[/metrics] --> C3[Prometheus metrics]
|
||||
C4[/schema] --> C5[API documentation]
|
||||
C6[/playground] --> C7[Interactive testing]
|
||||
end
|
||||
|
||||
subgraph "LLM Integration"
|
||||
D[/llm/{url}] --> D1[Q&A over URL]
|
||||
D2[/ask] --> D3[Library context search]
|
||||
D4[/config/dump] --> D5[Config validation]
|
||||
end
|
||||
|
||||
subgraph "MCP Protocol"
|
||||
E[/mcp/sse] --> E1[Server-Sent Events]
|
||||
E2[/mcp/ws] --> E3[WebSocket connection]
|
||||
E4[/mcp/schema] --> E5[MCP tool definitions]
|
||||
end
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style B fill:#f3e5f5
|
||||
style C fill:#e8f5e8
|
||||
style D fill:#fff3e0
|
||||
style E fill:#fce4ec
|
||||
```
|
||||
|
||||
### Request Processing Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant FastAPI
|
||||
participant RequestValidator
|
||||
participant BrowserPool
|
||||
participant Playwright
|
||||
participant ExtractionEngine
|
||||
participant LLMProvider
|
||||
|
||||
Client->>FastAPI: POST /crawl with config
|
||||
FastAPI->>RequestValidator: Validate JSON structure
|
||||
|
||||
alt Valid Request
|
||||
RequestValidator-->>FastAPI: ✓ Validated
|
||||
FastAPI->>BrowserPool: Request browser instance
|
||||
BrowserPool->>Playwright: Launch browser/reuse session
|
||||
Playwright-->>BrowserPool: Browser ready
|
||||
BrowserPool-->>FastAPI: Browser allocated
|
||||
|
||||
FastAPI->>Playwright: Navigate to URL
|
||||
Playwright->>Playwright: Execute JS, wait conditions
|
||||
Playwright-->>FastAPI: Page content ready
|
||||
|
||||
FastAPI->>ExtractionEngine: Process content
|
||||
|
||||
alt LLM Extraction
|
||||
ExtractionEngine->>LLMProvider: Send content + schema
|
||||
LLMProvider-->>ExtractionEngine: Structured data
|
||||
else CSS Extraction
|
||||
ExtractionEngine->>ExtractionEngine: Apply CSS selectors
|
||||
end
|
||||
|
||||
ExtractionEngine-->>FastAPI: Extraction complete
|
||||
FastAPI->>BrowserPool: Release browser
|
||||
FastAPI-->>Client: CrawlResult response
|
||||
|
||||
else Invalid Request
|
||||
RequestValidator-->>FastAPI: ✗ Validation error
|
||||
FastAPI-->>Client: 400 Bad Request
|
||||
end
|
||||
```
|
||||
|
||||
### Configuration Management Flow
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> ConfigLoading
|
||||
|
||||
ConfigLoading --> DefaultConfig: Load default config.yml
|
||||
ConfigLoading --> CustomConfig: Custom config mounted
|
||||
ConfigLoading --> EnvOverrides: Environment variables
|
||||
|
||||
DefaultConfig --> ConfigMerging
|
||||
CustomConfig --> ConfigMerging
|
||||
EnvOverrides --> ConfigMerging
|
||||
|
||||
ConfigMerging --> ConfigValidation
|
||||
|
||||
ConfigValidation --> Valid: Schema validation passes
|
||||
ConfigValidation --> Invalid: Validation errors
|
||||
|
||||
Invalid --> ConfigError: Log errors and exit
|
||||
ConfigError --> [*]
|
||||
|
||||
Valid --> ServiceInitialization
|
||||
ServiceInitialization --> FastAPISetup
|
||||
ServiceInitialization --> BrowserPoolInit
|
||||
ServiceInitialization --> CacheSetup
|
||||
|
||||
FastAPISetup --> Running
|
||||
BrowserPoolInit --> Running
|
||||
CacheSetup --> Running
|
||||
|
||||
Running --> ConfigReload: Config change detected
|
||||
ConfigReload --> ConfigValidation
|
||||
|
||||
Running --> [*]: Service shutdown
|
||||
|
||||
note right of ConfigMerging : Priority: ENV > Custom > Default
|
||||
note right of ServiceInitialization : All services must initialize successfully
|
||||
```
|
||||
|
||||
### Multi-Architecture Build Process
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Developer Push] --> B[GitHub Repository]
|
||||
|
||||
B --> C[Docker Buildx]
|
||||
C --> D{Build Strategy}
|
||||
|
||||
D -->|Multi-arch| E[Parallel Builds]
|
||||
D -->|Single-arch| F[Platform-specific Build]
|
||||
|
||||
E --> G[AMD64 Build]
|
||||
E --> H[ARM64 Build]
|
||||
|
||||
F --> I[Target Platform Build]
|
||||
|
||||
subgraph "AMD64 Build Process"
|
||||
G --> G1[Ubuntu base image]
|
||||
G1 --> G2[Python 3.11 install]
|
||||
G2 --> G3[System dependencies]
|
||||
G3 --> G4[Crawl4AI installation]
|
||||
G4 --> G5[Playwright setup]
|
||||
G5 --> G6[FastAPI configuration]
|
||||
G6 --> G7[AMD64 image ready]
|
||||
end
|
||||
|
||||
subgraph "ARM64 Build Process"
|
||||
H --> H1[Ubuntu ARM64 base]
|
||||
H1 --> H2[Python 3.11 install]
|
||||
H2 --> H3[ARM-specific deps]
|
||||
H3 --> H4[Crawl4AI installation]
|
||||
H4 --> H5[Playwright setup]
|
||||
H5 --> H6[FastAPI configuration]
|
||||
H6 --> H7[ARM64 image ready]
|
||||
end
|
||||
|
||||
subgraph "Single Architecture"
|
||||
I --> I1[Base image selection]
|
||||
I1 --> I2[Platform dependencies]
|
||||
I2 --> I3[Application setup]
|
||||
I3 --> I4[Platform image ready]
|
||||
end
|
||||
|
||||
G7 --> J[Multi-arch Manifest]
|
||||
H7 --> J
|
||||
I4 --> K[Platform Image]
|
||||
|
||||
J --> L[Docker Hub Registry]
|
||||
K --> L
|
||||
|
||||
L --> M[Pull Request Auto-selects Architecture]
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style J fill:#c8e6c9
|
||||
style K fill:#c8e6c9
|
||||
style L fill:#f3e5f5
|
||||
style M fill:#e8f5e8
|
||||
```
|
||||
|
||||
### MCP Integration Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "MCP Client Applications"
|
||||
A[Claude Code] --> B[MCP Protocol]
|
||||
C[Cursor IDE] --> B
|
||||
D[Windsurf] --> B
|
||||
E[Custom MCP Client] --> B
|
||||
end
|
||||
|
||||
subgraph "Crawl4AI MCP Server"
|
||||
B --> F[MCP Endpoint Router]
|
||||
F --> G[SSE Transport /mcp/sse]
|
||||
F --> H[WebSocket Transport /mcp/ws]
|
||||
F --> I[Schema Endpoint /mcp/schema]
|
||||
|
||||
G --> J[MCP Tool Handler]
|
||||
H --> J
|
||||
|
||||
J --> K[Tool: md]
|
||||
J --> L[Tool: html]
|
||||
J --> M[Tool: screenshot]
|
||||
J --> N[Tool: pdf]
|
||||
J --> O[Tool: execute_js]
|
||||
J --> P[Tool: crawl]
|
||||
J --> Q[Tool: ask]
|
||||
end
|
||||
|
||||
subgraph "Crawl4AI Core Services"
|
||||
K --> R[Markdown Generator]
|
||||
L --> S[HTML Preprocessor]
|
||||
M --> T[Screenshot Service]
|
||||
N --> U[PDF Generator]
|
||||
O --> V[JavaScript Executor]
|
||||
P --> W[Batch Crawler]
|
||||
Q --> X[Context Search]
|
||||
|
||||
R --> Y[Browser Pool]
|
||||
S --> Y
|
||||
T --> Y
|
||||
U --> Y
|
||||
V --> Y
|
||||
W --> Y
|
||||
X --> Z[Knowledge Base]
|
||||
end
|
||||
|
||||
subgraph "External Resources"
|
||||
Y --> AA[Playwright Browsers]
|
||||
Z --> BB[Library Documentation]
|
||||
Z --> CC[Code Examples]
|
||||
AA --> DD[Web Pages]
|
||||
end
|
||||
|
||||
style B fill:#e3f2fd
|
||||
style J fill:#f3e5f5
|
||||
style Y fill:#e8f5e8
|
||||
style Z fill:#fff3e0
|
||||
```
|
||||
|
||||
### API Request/Response Flow Patterns
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant LoadBalancer
|
||||
participant FastAPI
|
||||
participant ConfigValidator
|
||||
participant BrowserManager
|
||||
participant CrawlEngine
|
||||
participant ResponseBuilder
|
||||
|
||||
Note over Client,ResponseBuilder: Basic Crawl Request
|
||||
|
||||
Client->>LoadBalancer: POST /crawl
|
||||
LoadBalancer->>FastAPI: Route request
|
||||
|
||||
FastAPI->>ConfigValidator: Validate browser_config
|
||||
ConfigValidator-->>FastAPI: ✓ Valid BrowserConfig
|
||||
|
||||
FastAPI->>ConfigValidator: Validate crawler_config
|
||||
ConfigValidator-->>FastAPI: ✓ Valid CrawlerRunConfig
|
||||
|
||||
FastAPI->>BrowserManager: Allocate browser
|
||||
BrowserManager-->>FastAPI: Browser instance
|
||||
|
||||
FastAPI->>CrawlEngine: Execute crawl
|
||||
|
||||
Note over CrawlEngine: Page processing
|
||||
CrawlEngine->>CrawlEngine: Navigate & wait
|
||||
CrawlEngine->>CrawlEngine: Extract content
|
||||
CrawlEngine->>CrawlEngine: Apply strategies
|
||||
|
||||
CrawlEngine-->>FastAPI: CrawlResult
|
||||
|
||||
FastAPI->>ResponseBuilder: Format response
|
||||
ResponseBuilder-->>FastAPI: JSON response
|
||||
|
||||
FastAPI->>BrowserManager: Release browser
|
||||
FastAPI-->>LoadBalancer: Response ready
|
||||
LoadBalancer-->>Client: 200 OK + CrawlResult
|
||||
|
||||
Note over Client,ResponseBuilder: Streaming Request
|
||||
|
||||
Client->>FastAPI: POST /crawl/stream
|
||||
FastAPI-->>Client: 200 OK (stream start)
|
||||
|
||||
loop For each URL
|
||||
FastAPI->>CrawlEngine: Process URL
|
||||
CrawlEngine-->>FastAPI: Result ready
|
||||
FastAPI-->>Client: NDJSON line
|
||||
end
|
||||
|
||||
FastAPI-->>Client: Stream completed
|
||||
```
|
||||
|
||||
### Configuration Validation Workflow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Client Request] --> B[JSON Payload]
|
||||
B --> C{Pre-validation}
|
||||
|
||||
C -->|✓ Valid JSON| D[Extract Configurations]
|
||||
C -->|✗ Invalid JSON| E[Return 400 Bad Request]
|
||||
|
||||
D --> F[BrowserConfig Validation]
|
||||
D --> G[CrawlerRunConfig Validation]
|
||||
|
||||
F --> H{BrowserConfig Valid?}
|
||||
G --> I{CrawlerRunConfig Valid?}
|
||||
|
||||
H -->|✓ Valid| J[Browser Setup]
|
||||
H -->|✗ Invalid| K[Log Browser Config Errors]
|
||||
|
||||
I -->|✓ Valid| L[Crawler Setup]
|
||||
I -->|✗ Invalid| M[Log Crawler Config Errors]
|
||||
|
||||
K --> N[Collect All Errors]
|
||||
M --> N
|
||||
N --> O[Return 422 Validation Error]
|
||||
|
||||
J --> P{Both Configs Valid?}
|
||||
L --> P
|
||||
|
||||
P -->|✓ Yes| Q[Proceed to Crawling]
|
||||
P -->|✗ No| O
|
||||
|
||||
Q --> R[Execute Crawl Pipeline]
|
||||
R --> S[Return CrawlResult]
|
||||
|
||||
E --> T[Client Error Response]
|
||||
O --> T
|
||||
S --> U[Client Success Response]
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style Q fill:#c8e6c9
|
||||
style S fill:#c8e6c9
|
||||
style U fill:#c8e6c9
|
||||
style E fill:#ffcdd2
|
||||
style O fill:#ffcdd2
|
||||
style T fill:#ffcdd2
|
||||
```
|
||||
|
||||
### Production Deployment Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Load Balancer Layer"
|
||||
A[NGINX/HAProxy] --> B[Health Check]
|
||||
A --> C[Request Routing]
|
||||
A --> D[SSL Termination]
|
||||
end
|
||||
|
||||
subgraph "Application Layer"
|
||||
C --> E[Crawl4AI Instance 1]
|
||||
C --> F[Crawl4AI Instance 2]
|
||||
C --> G[Crawl4AI Instance N]
|
||||
|
||||
E --> H[FastAPI Server]
|
||||
F --> I[FastAPI Server]
|
||||
G --> J[FastAPI Server]
|
||||
|
||||
H --> K[Browser Pool 1]
|
||||
I --> L[Browser Pool 2]
|
||||
J --> M[Browser Pool N]
|
||||
end
|
||||
|
||||
subgraph "Shared Services"
|
||||
N[Redis Cluster] --> E
|
||||
N --> F
|
||||
N --> G
|
||||
|
||||
O[Monitoring Stack] --> P[Prometheus]
|
||||
O --> Q[Grafana]
|
||||
O --> R[AlertManager]
|
||||
|
||||
P --> E
|
||||
P --> F
|
||||
P --> G
|
||||
end
|
||||
|
||||
subgraph "External Dependencies"
|
||||
S[OpenAI API] -.-> H
|
||||
T[Anthropic API] -.-> I
|
||||
U[Local LLM Cluster] -.-> J
|
||||
end
|
||||
|
||||
subgraph "Persistent Storage"
|
||||
V[Configuration Volume] --> E
|
||||
V --> F
|
||||
V --> G
|
||||
|
||||
W[Cache Volume] --> N
|
||||
X[Logs Volume] --> O
|
||||
end
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style E fill:#f3e5f5
|
||||
style F fill:#f3e5f5
|
||||
style G fill:#f3e5f5
|
||||
style N fill:#e8f5e8
|
||||
style O fill:#fff3e0
|
||||
```
|
||||
|
||||
### Docker Resource Management
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Resource Allocation"
|
||||
A[Host Resources] --> B[CPU Cores]
|
||||
A --> C[Memory GB]
|
||||
A --> D[Disk Space]
|
||||
A --> E[Network Bandwidth]
|
||||
|
||||
B --> F[Container Limits]
|
||||
C --> F
|
||||
D --> F
|
||||
E --> F
|
||||
end
|
||||
|
||||
subgraph "Container Configuration"
|
||||
F --> G[--cpus=4]
|
||||
F --> H[--memory=8g]
|
||||
F --> I[--shm-size=2g]
|
||||
F --> J[Volume Mounts]
|
||||
|
||||
G --> K[Browser Processes]
|
||||
H --> L[Browser Memory]
|
||||
I --> M[Shared Memory for Browsers]
|
||||
J --> N[Config & Cache Storage]
|
||||
end
|
||||
|
||||
subgraph "Monitoring & Scaling"
|
||||
O[Resource Monitor] --> P[CPU Usage %]
|
||||
O --> Q[Memory Usage %]
|
||||
O --> R[Request Queue Length]
|
||||
|
||||
P --> S{CPU > 80%?}
|
||||
Q --> T{Memory > 90%?}
|
||||
R --> U{Queue > 100?}
|
||||
|
||||
S -->|Yes| V[Scale Up]
|
||||
T -->|Yes| V
|
||||
U -->|Yes| V
|
||||
|
||||
V --> W[Add Container Instance]
|
||||
W --> X[Update Load Balancer]
|
||||
end
|
||||
|
||||
subgraph "Performance Optimization"
|
||||
Y[Browser Pool Tuning] --> Z[Max Pages: 40]
|
||||
Y --> AA[Idle TTL: 30min]
|
||||
Y --> BB[Concurrency Limits]
|
||||
|
||||
Z --> CC[Memory Efficiency]
|
||||
AA --> DD[Resource Cleanup]
|
||||
BB --> EE[Throughput Control]
|
||||
end
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style F fill:#f3e5f5
|
||||
style O fill:#e8f5e8
|
||||
style Y fill:#fff3e0
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/), [API Reference](https://docs.crawl4ai.com/api/), [MCP Integration](https://docs.crawl4ai.com/core/docker-deployment/#mcp-model-context-protocol-support), [Production Configuration](https://docs.crawl4ai.com/core/docker-deployment/#production-deployment)
|
||||
478
docs/md_v2/assets/llm.txt/diagrams/extraction-llm.txt
Normal file
@@ -0,0 +1,478 @@
|
||||
## Extraction Strategy Workflows and Architecture
|
||||
|
||||
Visual representations of Crawl4AI's data extraction approaches, strategy selection, and processing workflows.
|
||||
|
||||
### Extraction Strategy Decision Tree
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Content to Extract] --> B{Content Type?}
|
||||
|
||||
B -->|Simple Patterns| C[Common Data Types]
|
||||
B -->|Structured HTML| D[Predictable Structure]
|
||||
B -->|Complex Content| E[Requires Reasoning]
|
||||
B -->|Mixed Content| F[Multiple Data Types]
|
||||
|
||||
C --> C1{Pattern Type?}
|
||||
C1 -->|Email, Phone, URLs| C2[Built-in Regex Patterns]
|
||||
C1 -->|Custom Patterns| C3[Custom Regex Strategy]
|
||||
C1 -->|LLM-Generated| C4[One-time Pattern Generation]
|
||||
|
||||
D --> D1{Selector Type?}
|
||||
D1 -->|CSS Selectors| D2[JsonCssExtractionStrategy]
|
||||
D1 -->|XPath Expressions| D3[JsonXPathExtractionStrategy]
|
||||
D1 -->|Need Schema?| D4[Auto-generate Schema with LLM]
|
||||
|
||||
E --> E1{LLM Provider?}
|
||||
E1 -->|OpenAI/Anthropic| E2[Cloud LLM Strategy]
|
||||
E1 -->|Local Ollama| E3[Local LLM Strategy]
|
||||
E1 -->|Cost-sensitive| E4[Hybrid: Generate Schema Once]
|
||||
|
||||
F --> F1[Multi-Strategy Approach]
|
||||
F1 --> F2[1. Regex for Patterns]
|
||||
F1 --> F3[2. CSS for Structure]
|
||||
F1 --> F4[3. LLM for Complex Analysis]
|
||||
|
||||
C2 --> G[Fast Extraction ⚡]
|
||||
C3 --> G
|
||||
C4 --> H[Cached Pattern Reuse]
|
||||
|
||||
D2 --> I[Schema-based Extraction 🏗️]
|
||||
D3 --> I
|
||||
D4 --> J[Generated Schema Cache]
|
||||
|
||||
E2 --> K[Intelligent Parsing 🧠]
|
||||
E3 --> K
|
||||
E4 --> L[Hybrid Cost-Effective]
|
||||
|
||||
F2 --> M[Comprehensive Results 📊]
|
||||
F3 --> M
|
||||
F4 --> M
|
||||
|
||||
style G fill:#c8e6c9
|
||||
style I fill:#e3f2fd
|
||||
style K fill:#fff3e0
|
||||
style M fill:#f3e5f5
|
||||
style H fill:#e8f5e8
|
||||
style J fill:#e8f5e8
|
||||
style L fill:#ffecb3
|
||||
```
|
||||
|
||||
### LLM Extraction Strategy Workflow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Crawler
|
||||
participant LLMStrategy
|
||||
participant Chunker
|
||||
participant LLMProvider
|
||||
participant Parser
|
||||
|
||||
User->>Crawler: Configure LLMExtractionStrategy
|
||||
User->>Crawler: arun(url, config)
|
||||
|
||||
Crawler->>Crawler: Navigate to URL
|
||||
Crawler->>Crawler: Extract content (HTML/Markdown)
|
||||
Crawler->>LLMStrategy: Process content
|
||||
|
||||
LLMStrategy->>LLMStrategy: Check content size
|
||||
|
||||
alt Content > chunk_threshold
|
||||
LLMStrategy->>Chunker: Split into chunks with overlap
|
||||
Chunker-->>LLMStrategy: Return chunks[]
|
||||
|
||||
loop For each chunk
|
||||
LLMStrategy->>LLMProvider: Send chunk + schema + instruction
|
||||
LLMProvider-->>LLMStrategy: Return structured JSON
|
||||
end
|
||||
|
||||
LLMStrategy->>LLMStrategy: Merge chunk results
|
||||
else Content <= threshold
|
||||
LLMStrategy->>LLMProvider: Send full content + schema
|
||||
LLMProvider-->>LLMStrategy: Return structured JSON
|
||||
end
|
||||
|
||||
LLMStrategy->>Parser: Validate JSON schema
|
||||
Parser-->>LLMStrategy: Validated data
|
||||
|
||||
LLMStrategy->>LLMStrategy: Track token usage
|
||||
LLMStrategy-->>Crawler: Return extracted_content
|
||||
|
||||
Crawler-->>User: CrawlResult with JSON data
|
||||
|
||||
User->>LLMStrategy: show_usage()
|
||||
LLMStrategy-->>User: Token count & estimated cost
|
||||
```
|
||||
|
||||
### Schema-Based Extraction Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Schema Definition"
|
||||
A[JSON Schema] --> A1[baseSelector]
|
||||
A --> A2[fields[]]
|
||||
A --> A3[nested structures]
|
||||
|
||||
A2 --> A4[CSS/XPath selectors]
|
||||
A2 --> A5[Data types: text, html, attribute]
|
||||
A2 --> A6[Default values]
|
||||
|
||||
A3 --> A7[nested objects]
|
||||
A3 --> A8[nested_list arrays]
|
||||
A3 --> A9[simple lists]
|
||||
end
|
||||
|
||||
subgraph "Extraction Engine"
|
||||
B[HTML Content] --> C[Selector Engine]
|
||||
C --> C1[CSS Selector Parser]
|
||||
C --> C2[XPath Evaluator]
|
||||
|
||||
C1 --> D[Element Matcher]
|
||||
C2 --> D
|
||||
|
||||
D --> E[Type Converter]
|
||||
E --> E1[Text Extraction]
|
||||
E --> E2[HTML Preservation]
|
||||
E --> E3[Attribute Extraction]
|
||||
E --> E4[Nested Processing]
|
||||
end
|
||||
|
||||
subgraph "Result Processing"
|
||||
F[Raw Extracted Data] --> G[Structure Builder]
|
||||
G --> G1[Object Construction]
|
||||
G --> G2[Array Assembly]
|
||||
G --> G3[Type Validation]
|
||||
|
||||
G1 --> H[JSON Output]
|
||||
G2 --> H
|
||||
G3 --> H
|
||||
end
|
||||
|
||||
A --> C
|
||||
E --> F
|
||||
H --> I[extracted_content]
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style C fill:#f3e5f5
|
||||
style G fill:#e8f5e8
|
||||
style H fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Automatic Schema Generation Process
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> CheckCache
|
||||
|
||||
CheckCache --> CacheHit: Schema exists
|
||||
CheckCache --> SamplePage: Schema missing
|
||||
|
||||
CacheHit --> LoadSchema
|
||||
LoadSchema --> FastExtraction
|
||||
|
||||
SamplePage --> ExtractHTML: Crawl sample URL
|
||||
ExtractHTML --> LLMAnalysis: Send HTML to LLM
|
||||
LLMAnalysis --> GenerateSchema: Create CSS/XPath selectors
|
||||
GenerateSchema --> ValidateSchema: Test generated schema
|
||||
|
||||
ValidateSchema --> SchemaWorks: Valid selectors
|
||||
ValidateSchema --> RefineSchema: Invalid selectors
|
||||
|
||||
RefineSchema --> LLMAnalysis: Iterate with feedback
|
||||
|
||||
SchemaWorks --> CacheSchema: Save for reuse
|
||||
CacheSchema --> FastExtraction: Use cached schema
|
||||
|
||||
FastExtraction --> [*]: No more LLM calls needed
|
||||
|
||||
note right of CheckCache : One-time LLM cost
|
||||
note right of FastExtraction : Unlimited fast reuse
|
||||
note right of CacheSchema : JSON file storage
|
||||
```
|
||||
|
||||
### Multi-Strategy Extraction Pipeline
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
A[Web Page Content] --> B[Strategy Pipeline]
|
||||
|
||||
subgraph B["Extraction Pipeline"]
|
||||
B1[Stage 1: Regex Patterns]
|
||||
B2[Stage 2: Schema-based CSS]
|
||||
B3[Stage 3: LLM Analysis]
|
||||
|
||||
B1 --> B1a[Email addresses]
|
||||
B1 --> B1b[Phone numbers]
|
||||
B1 --> B1c[URLs and links]
|
||||
B1 --> B1d[Currency amounts]
|
||||
|
||||
B2 --> B2a[Structured products]
|
||||
B2 --> B2b[Article metadata]
|
||||
B2 --> B2c[User reviews]
|
||||
B2 --> B2d[Navigation links]
|
||||
|
||||
B3 --> B3a[Sentiment analysis]
|
||||
B3 --> B3b[Key topics]
|
||||
B3 --> B3c[Entity recognition]
|
||||
B3 --> B3d[Content summary]
|
||||
end
|
||||
|
||||
B1a --> C[Result Merger]
|
||||
B1b --> C
|
||||
B1c --> C
|
||||
B1d --> C
|
||||
|
||||
B2a --> C
|
||||
B2b --> C
|
||||
B2c --> C
|
||||
B2d --> C
|
||||
|
||||
B3a --> C
|
||||
B3b --> C
|
||||
B3c --> C
|
||||
B3d --> C
|
||||
|
||||
C --> D[Combined JSON Output]
|
||||
D --> E[Final CrawlResult]
|
||||
|
||||
style B1 fill:#c8e6c9
|
||||
style B2 fill:#e3f2fd
|
||||
style B3 fill:#fff3e0
|
||||
style C fill:#f3e5f5
|
||||
```
|
||||
|
||||
### Performance Comparison Matrix
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Strategy Performance"
|
||||
A[Extraction Strategy Comparison]
|
||||
|
||||
subgraph "Speed ⚡"
|
||||
S1[Regex: ~10ms]
|
||||
S2[CSS Schema: ~50ms]
|
||||
S3[XPath: ~100ms]
|
||||
S4[LLM: ~2-10s]
|
||||
end
|
||||
|
||||
subgraph "Accuracy 🎯"
|
||||
A1[Regex: Pattern-dependent]
|
||||
A2[CSS: High for structured]
|
||||
A3[XPath: Very high]
|
||||
A4[LLM: Excellent for complex]
|
||||
end
|
||||
|
||||
subgraph "Cost 💰"
|
||||
C1[Regex: Free]
|
||||
C2[CSS: Free]
|
||||
C3[XPath: Free]
|
||||
C4[LLM: $0.001-0.01 per page]
|
||||
end
|
||||
|
||||
subgraph "Complexity 🔧"
|
||||
X1[Regex: Simple patterns only]
|
||||
X2[CSS: Structured HTML]
|
||||
X3[XPath: Complex selectors]
|
||||
X4[LLM: Any content type]
|
||||
end
|
||||
end
|
||||
|
||||
style S1 fill:#c8e6c9
|
||||
style S2 fill:#e8f5e8
|
||||
style S3 fill:#fff3e0
|
||||
style S4 fill:#ffcdd2
|
||||
|
||||
style A2 fill:#e8f5e8
|
||||
style A3 fill:#c8e6c9
|
||||
style A4 fill:#c8e6c9
|
||||
|
||||
style C1 fill:#c8e6c9
|
||||
style C2 fill:#c8e6c9
|
||||
style C3 fill:#c8e6c9
|
||||
style C4 fill:#fff3e0
|
||||
|
||||
style X1 fill:#ffcdd2
|
||||
style X2 fill:#e8f5e8
|
||||
style X3 fill:#c8e6c9
|
||||
style X4 fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Regex Pattern Strategy Flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Regex Extraction] --> B{Pattern Source?}
|
||||
|
||||
B -->|Built-in| C[Use Predefined Patterns]
|
||||
B -->|Custom| D[Define Custom Regex]
|
||||
B -->|LLM-Generated| E[Generate with AI]
|
||||
|
||||
C --> C1[Email Pattern]
|
||||
C --> C2[Phone Pattern]
|
||||
C --> C3[URL Pattern]
|
||||
C --> C4[Currency Pattern]
|
||||
C --> C5[Date Pattern]
|
||||
|
||||
D --> D1[Write Custom Regex]
|
||||
D --> D2[Test Pattern]
|
||||
D --> D3{Pattern Works?}
|
||||
D3 -->|No| D1
|
||||
D3 -->|Yes| D4[Use Pattern]
|
||||
|
||||
E --> E1[Provide Sample Content]
|
||||
E --> E2[LLM Analyzes Content]
|
||||
E --> E3[Generate Optimized Regex]
|
||||
E --> E4[Cache Pattern for Reuse]
|
||||
|
||||
C1 --> F[Pattern Matching]
|
||||
C2 --> F
|
||||
C3 --> F
|
||||
C4 --> F
|
||||
C5 --> F
|
||||
D4 --> F
|
||||
E4 --> F
|
||||
|
||||
F --> G[Extract Matches]
|
||||
G --> H[Group by Pattern Type]
|
||||
H --> I[JSON Output with Labels]
|
||||
|
||||
style C fill:#e8f5e8
|
||||
style D fill:#e3f2fd
|
||||
style E fill:#fff3e0
|
||||
style F fill:#f3e5f5
|
||||
```
|
||||
|
||||
### Complex Schema Structure Visualization
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "E-commerce Schema Example"
|
||||
A[Category baseSelector] --> B[Category Fields]
|
||||
A --> C[Products nested_list]
|
||||
|
||||
B --> B1[category_name]
|
||||
B --> B2[category_id attribute]
|
||||
B --> B3[category_url attribute]
|
||||
|
||||
C --> C1[Product baseSelector]
|
||||
C1 --> C2[name text]
|
||||
C1 --> C3[price text]
|
||||
C1 --> C4[Details nested object]
|
||||
C1 --> C5[Features list]
|
||||
C1 --> C6[Reviews nested_list]
|
||||
|
||||
C4 --> C4a[brand text]
|
||||
C4 --> C4b[model text]
|
||||
C4 --> C4c[specs html]
|
||||
|
||||
C5 --> C5a[feature text array]
|
||||
|
||||
C6 --> C6a[reviewer text]
|
||||
C6 --> C6b[rating attribute]
|
||||
C6 --> C6c[comment text]
|
||||
C6 --> C6d[date attribute]
|
||||
end
|
||||
|
||||
subgraph "JSON Output Structure"
|
||||
D[categories array] --> D1[category object]
|
||||
D1 --> D2[category_name]
|
||||
D1 --> D3[category_id]
|
||||
D1 --> D4[products array]
|
||||
|
||||
D4 --> D5[product object]
|
||||
D5 --> D6[name, price]
|
||||
D5 --> D7[details object]
|
||||
D5 --> D8[features array]
|
||||
D5 --> D9[reviews array]
|
||||
|
||||
D7 --> D7a[brand, model, specs]
|
||||
D8 --> D8a[feature strings]
|
||||
D9 --> D9a[review objects]
|
||||
end
|
||||
|
||||
A -.-> D
|
||||
B1 -.-> D2
|
||||
C2 -.-> D6
|
||||
C4 -.-> D7
|
||||
C5 -.-> D8
|
||||
C6 -.-> D9
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style C fill:#f3e5f5
|
||||
style C4 fill:#e8f5e8
|
||||
style D fill:#fff3e0
|
||||
```
|
||||
|
||||
### Error Handling and Fallback Strategy
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> PrimaryStrategy
|
||||
|
||||
PrimaryStrategy --> Success: Extraction successful
|
||||
PrimaryStrategy --> ValidationFailed: Invalid data
|
||||
PrimaryStrategy --> ExtractionFailed: No matches found
|
||||
PrimaryStrategy --> TimeoutError: LLM timeout
|
||||
|
||||
ValidationFailed --> FallbackStrategy: Try alternative
|
||||
ExtractionFailed --> FallbackStrategy: Try alternative
|
||||
TimeoutError --> FallbackStrategy: Try alternative
|
||||
|
||||
FallbackStrategy --> FallbackSuccess: Fallback works
|
||||
FallbackStrategy --> FallbackFailed: All strategies failed
|
||||
|
||||
FallbackSuccess --> Success: Return results
|
||||
FallbackFailed --> ErrorReport: Log failure details
|
||||
|
||||
Success --> [*]: Complete
|
||||
ErrorReport --> [*]: Return empty results
|
||||
|
||||
note right of PrimaryStrategy : Try fastest/most accurate first
|
||||
note right of FallbackStrategy : Use simpler but reliable method
|
||||
note left of ErrorReport : Provide debugging information
|
||||
```
|
||||
|
||||
### Token Usage and Cost Optimization
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[LLM Extraction Request] --> B{Content Size Check}
|
||||
|
||||
B -->|Small < 1200 tokens| C[Single LLM Call]
|
||||
B -->|Large > 1200 tokens| D[Chunking Strategy]
|
||||
|
||||
C --> C1[Send full content]
|
||||
C1 --> C2[Parse JSON response]
|
||||
C2 --> C3[Track token usage]
|
||||
|
||||
D --> D1[Split into chunks]
|
||||
D1 --> D2[Add overlap between chunks]
|
||||
D2 --> D3[Process chunks in parallel]
|
||||
|
||||
D3 --> D4[Chunk 1 → LLM]
|
||||
D3 --> D5[Chunk 2 → LLM]
|
||||
D3 --> D6[Chunk N → LLM]
|
||||
|
||||
D4 --> D7[Merge results]
|
||||
D5 --> D7
|
||||
D6 --> D7
|
||||
|
||||
D7 --> D8[Deduplicate data]
|
||||
D8 --> D9[Aggregate token usage]
|
||||
|
||||
C3 --> E[Cost Calculation]
|
||||
D9 --> E
|
||||
|
||||
E --> F[Usage Report]
|
||||
F --> F1[Prompt tokens: X]
|
||||
F --> F2[Completion tokens: Y]
|
||||
F --> F3[Total cost: $Z]
|
||||
|
||||
style C fill:#c8e6c9
|
||||
style D fill:#fff3e0
|
||||
style E fill:#e3f2fd
|
||||
style F fill:#f3e5f5
|
||||
```
|
||||
|
||||
**📖 Learn more:** [LLM Strategies](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Pattern Matching](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)
|
||||
478
docs/md_v2/assets/llm.txt/diagrams/extraction-no-llm.txt
Normal file
@@ -0,0 +1,478 @@
|
||||
## Extraction Strategy Workflows and Architecture
|
||||
|
||||
Visual representations of Crawl4AI's data extraction approaches, strategy selection, and processing workflows.
|
||||
|
||||
### Extraction Strategy Decision Tree
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Content to Extract] --> B{Content Type?}
|
||||
|
||||
B -->|Simple Patterns| C[Common Data Types]
|
||||
B -->|Structured HTML| D[Predictable Structure]
|
||||
B -->|Complex Content| E[Requires Reasoning]
|
||||
B -->|Mixed Content| F[Multiple Data Types]
|
||||
|
||||
C --> C1{Pattern Type?}
|
||||
C1 -->|Email, Phone, URLs| C2[Built-in Regex Patterns]
|
||||
C1 -->|Custom Patterns| C3[Custom Regex Strategy]
|
||||
C1 -->|LLM-Generated| C4[One-time Pattern Generation]
|
||||
|
||||
D --> D1{Selector Type?}
|
||||
D1 -->|CSS Selectors| D2[JsonCssExtractionStrategy]
|
||||
D1 -->|XPath Expressions| D3[JsonXPathExtractionStrategy]
|
||||
D1 -->|Need Schema?| D4[Auto-generate Schema with LLM]
|
||||
|
||||
E --> E1{LLM Provider?}
|
||||
E1 -->|OpenAI/Anthropic| E2[Cloud LLM Strategy]
|
||||
E1 -->|Local Ollama| E3[Local LLM Strategy]
|
||||
E1 -->|Cost-sensitive| E4[Hybrid: Generate Schema Once]
|
||||
|
||||
F --> F1[Multi-Strategy Approach]
|
||||
F1 --> F2[1. Regex for Patterns]
|
||||
F1 --> F3[2. CSS for Structure]
|
||||
F1 --> F4[3. LLM for Complex Analysis]
|
||||
|
||||
C2 --> G[Fast Extraction ⚡]
|
||||
C3 --> G
|
||||
C4 --> H[Cached Pattern Reuse]
|
||||
|
||||
D2 --> I[Schema-based Extraction 🏗️]
|
||||
D3 --> I
|
||||
D4 --> J[Generated Schema Cache]
|
||||
|
||||
E2 --> K[Intelligent Parsing 🧠]
|
||||
E3 --> K
|
||||
E4 --> L[Hybrid Cost-Effective]
|
||||
|
||||
F2 --> M[Comprehensive Results 📊]
|
||||
F3 --> M
|
||||
F4 --> M
|
||||
|
||||
style G fill:#c8e6c9
|
||||
style I fill:#e3f2fd
|
||||
style K fill:#fff3e0
|
||||
style M fill:#f3e5f5
|
||||
style H fill:#e8f5e8
|
||||
style J fill:#e8f5e8
|
||||
style L fill:#ffecb3
|
||||
```
|
||||
|
||||
### LLM Extraction Strategy Workflow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Crawler
|
||||
participant LLMStrategy
|
||||
participant Chunker
|
||||
participant LLMProvider
|
||||
participant Parser
|
||||
|
||||
User->>Crawler: Configure LLMExtractionStrategy
|
||||
User->>Crawler: arun(url, config)
|
||||
|
||||
Crawler->>Crawler: Navigate to URL
|
||||
Crawler->>Crawler: Extract content (HTML/Markdown)
|
||||
Crawler->>LLMStrategy: Process content
|
||||
|
||||
LLMStrategy->>LLMStrategy: Check content size
|
||||
|
||||
alt Content > chunk_threshold
|
||||
LLMStrategy->>Chunker: Split into chunks with overlap
|
||||
Chunker-->>LLMStrategy: Return chunks[]
|
||||
|
||||
loop For each chunk
|
||||
LLMStrategy->>LLMProvider: Send chunk + schema + instruction
|
||||
LLMProvider-->>LLMStrategy: Return structured JSON
|
||||
end
|
||||
|
||||
LLMStrategy->>LLMStrategy: Merge chunk results
|
||||
else Content <= threshold
|
||||
LLMStrategy->>LLMProvider: Send full content + schema
|
||||
LLMProvider-->>LLMStrategy: Return structured JSON
|
||||
end
|
||||
|
||||
LLMStrategy->>Parser: Validate JSON schema
|
||||
Parser-->>LLMStrategy: Validated data
|
||||
|
||||
LLMStrategy->>LLMStrategy: Track token usage
|
||||
LLMStrategy-->>Crawler: Return extracted_content
|
||||
|
||||
Crawler-->>User: CrawlResult with JSON data
|
||||
|
||||
User->>LLMStrategy: show_usage()
|
||||
LLMStrategy-->>User: Token count & estimated cost
|
||||
```
|
||||
|
||||
### Schema-Based Extraction Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Schema Definition"
|
||||
A[JSON Schema] --> A1[baseSelector]
|
||||
A --> A2[fields[]]
|
||||
A --> A3[nested structures]
|
||||
|
||||
A2 --> A4[CSS/XPath selectors]
|
||||
A2 --> A5[Data types: text, html, attribute]
|
||||
A2 --> A6[Default values]
|
||||
|
||||
A3 --> A7[nested objects]
|
||||
A3 --> A8[nested_list arrays]
|
||||
A3 --> A9[simple lists]
|
||||
end
|
||||
|
||||
subgraph "Extraction Engine"
|
||||
B[HTML Content] --> C[Selector Engine]
|
||||
C --> C1[CSS Selector Parser]
|
||||
C --> C2[XPath Evaluator]
|
||||
|
||||
C1 --> D[Element Matcher]
|
||||
C2 --> D
|
||||
|
||||
D --> E[Type Converter]
|
||||
E --> E1[Text Extraction]
|
||||
E --> E2[HTML Preservation]
|
||||
E --> E3[Attribute Extraction]
|
||||
E --> E4[Nested Processing]
|
||||
end
|
||||
|
||||
subgraph "Result Processing"
|
||||
F[Raw Extracted Data] --> G[Structure Builder]
|
||||
G --> G1[Object Construction]
|
||||
G --> G2[Array Assembly]
|
||||
G --> G3[Type Validation]
|
||||
|
||||
G1 --> H[JSON Output]
|
||||
G2 --> H
|
||||
G3 --> H
|
||||
end
|
||||
|
||||
A --> C
|
||||
E --> F
|
||||
H --> I[extracted_content]
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style C fill:#f3e5f5
|
||||
style G fill:#e8f5e8
|
||||
style H fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Automatic Schema Generation Process
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> CheckCache
|
||||
|
||||
CheckCache --> CacheHit: Schema exists
|
||||
CheckCache --> SamplePage: Schema missing
|
||||
|
||||
CacheHit --> LoadSchema
|
||||
LoadSchema --> FastExtraction
|
||||
|
||||
SamplePage --> ExtractHTML: Crawl sample URL
|
||||
ExtractHTML --> LLMAnalysis: Send HTML to LLM
|
||||
LLMAnalysis --> GenerateSchema: Create CSS/XPath selectors
|
||||
GenerateSchema --> ValidateSchema: Test generated schema
|
||||
|
||||
ValidateSchema --> SchemaWorks: Valid selectors
|
||||
ValidateSchema --> RefineSchema: Invalid selectors
|
||||
|
||||
RefineSchema --> LLMAnalysis: Iterate with feedback
|
||||
|
||||
SchemaWorks --> CacheSchema: Save for reuse
|
||||
CacheSchema --> FastExtraction: Use cached schema
|
||||
|
||||
FastExtraction --> [*]: No more LLM calls needed
|
||||
|
||||
note right of CheckCache : One-time LLM cost
|
||||
note right of FastExtraction : Unlimited fast reuse
|
||||
note right of CacheSchema : JSON file storage
|
||||
```
|
||||
|
||||
### Multi-Strategy Extraction Pipeline
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
A[Web Page Content] --> B[Strategy Pipeline]
|
||||
|
||||
subgraph B["Extraction Pipeline"]
|
||||
B1[Stage 1: Regex Patterns]
|
||||
B2[Stage 2: Schema-based CSS]
|
||||
B3[Stage 3: LLM Analysis]
|
||||
|
||||
B1 --> B1a[Email addresses]
|
||||
B1 --> B1b[Phone numbers]
|
||||
B1 --> B1c[URLs and links]
|
||||
B1 --> B1d[Currency amounts]
|
||||
|
||||
B2 --> B2a[Structured products]
|
||||
B2 --> B2b[Article metadata]
|
||||
B2 --> B2c[User reviews]
|
||||
B2 --> B2d[Navigation links]
|
||||
|
||||
B3 --> B3a[Sentiment analysis]
|
||||
B3 --> B3b[Key topics]
|
||||
B3 --> B3c[Entity recognition]
|
||||
B3 --> B3d[Content summary]
|
||||
end
|
||||
|
||||
B1a --> C[Result Merger]
|
||||
B1b --> C
|
||||
B1c --> C
|
||||
B1d --> C
|
||||
|
||||
B2a --> C
|
||||
B2b --> C
|
||||
B2c --> C
|
||||
B2d --> C
|
||||
|
||||
B3a --> C
|
||||
B3b --> C
|
||||
B3c --> C
|
||||
B3d --> C
|
||||
|
||||
C --> D[Combined JSON Output]
|
||||
D --> E[Final CrawlResult]
|
||||
|
||||
style B1 fill:#c8e6c9
|
||||
style B2 fill:#e3f2fd
|
||||
style B3 fill:#fff3e0
|
||||
style C fill:#f3e5f5
|
||||
```
|
||||
|
||||
### Performance Comparison Matrix
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Strategy Performance"
|
||||
A[Extraction Strategy Comparison]
|
||||
|
||||
subgraph "Speed ⚡"
|
||||
S1[Regex: ~10ms]
|
||||
S2[CSS Schema: ~50ms]
|
||||
S3[XPath: ~100ms]
|
||||
S4[LLM: ~2-10s]
|
||||
end
|
||||
|
||||
subgraph "Accuracy 🎯"
|
||||
A1[Regex: Pattern-dependent]
|
||||
A2[CSS: High for structured]
|
||||
A3[XPath: Very high]
|
||||
A4[LLM: Excellent for complex]
|
||||
end
|
||||
|
||||
subgraph "Cost 💰"
|
||||
C1[Regex: Free]
|
||||
C2[CSS: Free]
|
||||
C3[XPath: Free]
|
||||
C4[LLM: $0.001-0.01 per page]
|
||||
end
|
||||
|
||||
subgraph "Complexity 🔧"
|
||||
X1[Regex: Simple patterns only]
|
||||
X2[CSS: Structured HTML]
|
||||
X3[XPath: Complex selectors]
|
||||
X4[LLM: Any content type]
|
||||
end
|
||||
end
|
||||
|
||||
style S1 fill:#c8e6c9
|
||||
style S2 fill:#e8f5e8
|
||||
style S3 fill:#fff3e0
|
||||
style S4 fill:#ffcdd2
|
||||
|
||||
style A2 fill:#e8f5e8
|
||||
style A3 fill:#c8e6c9
|
||||
style A4 fill:#c8e6c9
|
||||
|
||||
style C1 fill:#c8e6c9
|
||||
style C2 fill:#c8e6c9
|
||||
style C3 fill:#c8e6c9
|
||||
style C4 fill:#fff3e0
|
||||
|
||||
style X1 fill:#ffcdd2
|
||||
style X2 fill:#e8f5e8
|
||||
style X3 fill:#c8e6c9
|
||||
style X4 fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Regex Pattern Strategy Flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Regex Extraction] --> B{Pattern Source?}
|
||||
|
||||
B -->|Built-in| C[Use Predefined Patterns]
|
||||
B -->|Custom| D[Define Custom Regex]
|
||||
B -->|LLM-Generated| E[Generate with AI]
|
||||
|
||||
C --> C1[Email Pattern]
|
||||
C --> C2[Phone Pattern]
|
||||
C --> C3[URL Pattern]
|
||||
C --> C4[Currency Pattern]
|
||||
C --> C5[Date Pattern]
|
||||
|
||||
D --> D1[Write Custom Regex]
|
||||
D --> D2[Test Pattern]
|
||||
D --> D3{Pattern Works?}
|
||||
D3 -->|No| D1
|
||||
D3 -->|Yes| D4[Use Pattern]
|
||||
|
||||
E --> E1[Provide Sample Content]
|
||||
E --> E2[LLM Analyzes Content]
|
||||
E --> E3[Generate Optimized Regex]
|
||||
E --> E4[Cache Pattern for Reuse]
|
||||
|
||||
C1 --> F[Pattern Matching]
|
||||
C2 --> F
|
||||
C3 --> F
|
||||
C4 --> F
|
||||
C5 --> F
|
||||
D4 --> F
|
||||
E4 --> F
|
||||
|
||||
F --> G[Extract Matches]
|
||||
G --> H[Group by Pattern Type]
|
||||
H --> I[JSON Output with Labels]
|
||||
|
||||
style C fill:#e8f5e8
|
||||
style D fill:#e3f2fd
|
||||
style E fill:#fff3e0
|
||||
style F fill:#f3e5f5
|
||||
```
|
||||
|
||||
### Complex Schema Structure Visualization
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "E-commerce Schema Example"
|
||||
A[Category baseSelector] --> B[Category Fields]
|
||||
A --> C[Products nested_list]
|
||||
|
||||
B --> B1[category_name]
|
||||
B --> B2[category_id attribute]
|
||||
B --> B3[category_url attribute]
|
||||
|
||||
C --> C1[Product baseSelector]
|
||||
C1 --> C2[name text]
|
||||
C1 --> C3[price text]
|
||||
C1 --> C4[Details nested object]
|
||||
C1 --> C5[Features list]
|
||||
C1 --> C6[Reviews nested_list]
|
||||
|
||||
C4 --> C4a[brand text]
|
||||
C4 --> C4b[model text]
|
||||
C4 --> C4c[specs html]
|
||||
|
||||
C5 --> C5a[feature text array]
|
||||
|
||||
C6 --> C6a[reviewer text]
|
||||
C6 --> C6b[rating attribute]
|
||||
C6 --> C6c[comment text]
|
||||
C6 --> C6d[date attribute]
|
||||
end
|
||||
|
||||
subgraph "JSON Output Structure"
|
||||
D[categories array] --> D1[category object]
|
||||
D1 --> D2[category_name]
|
||||
D1 --> D3[category_id]
|
||||
D1 --> D4[products array]
|
||||
|
||||
D4 --> D5[product object]
|
||||
D5 --> D6[name, price]
|
||||
D5 --> D7[details object]
|
||||
D5 --> D8[features array]
|
||||
D5 --> D9[reviews array]
|
||||
|
||||
D7 --> D7a[brand, model, specs]
|
||||
D8 --> D8a[feature strings]
|
||||
D9 --> D9a[review objects]
|
||||
end
|
||||
|
||||
A -.-> D
|
||||
B1 -.-> D2
|
||||
C2 -.-> D6
|
||||
C4 -.-> D7
|
||||
C5 -.-> D8
|
||||
C6 -.-> D9
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style C fill:#f3e5f5
|
||||
style C4 fill:#e8f5e8
|
||||
style D fill:#fff3e0
|
||||
```
|
||||
|
||||
### Error Handling and Fallback Strategy
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> PrimaryStrategy
|
||||
|
||||
PrimaryStrategy --> Success: Extraction successful
|
||||
PrimaryStrategy --> ValidationFailed: Invalid data
|
||||
PrimaryStrategy --> ExtractionFailed: No matches found
|
||||
PrimaryStrategy --> TimeoutError: LLM timeout
|
||||
|
||||
ValidationFailed --> FallbackStrategy: Try alternative
|
||||
ExtractionFailed --> FallbackStrategy: Try alternative
|
||||
TimeoutError --> FallbackStrategy: Try alternative
|
||||
|
||||
FallbackStrategy --> FallbackSuccess: Fallback works
|
||||
FallbackStrategy --> FallbackFailed: All strategies failed
|
||||
|
||||
FallbackSuccess --> Success: Return results
|
||||
FallbackFailed --> ErrorReport: Log failure details
|
||||
|
||||
Success --> [*]: Complete
|
||||
ErrorReport --> [*]: Return empty results
|
||||
|
||||
note right of PrimaryStrategy : Try fastest/most accurate first
|
||||
note right of FallbackStrategy : Use simpler but reliable method
|
||||
note left of ErrorReport : Provide debugging information
|
||||
```
|
||||
|
||||
### Token Usage and Cost Optimization
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[LLM Extraction Request] --> B{Content Size Check}
|
||||
|
||||
B -->|Small < 1200 tokens| C[Single LLM Call]
|
||||
B -->|Large > 1200 tokens| D[Chunking Strategy]
|
||||
|
||||
C --> C1[Send full content]
|
||||
C1 --> C2[Parse JSON response]
|
||||
C2 --> C3[Track token usage]
|
||||
|
||||
D --> D1[Split into chunks]
|
||||
D1 --> D2[Add overlap between chunks]
|
||||
D2 --> D3[Process chunks in parallel]
|
||||
|
||||
D3 --> D4[Chunk 1 → LLM]
|
||||
D3 --> D5[Chunk 2 → LLM]
|
||||
D3 --> D6[Chunk N → LLM]
|
||||
|
||||
D4 --> D7[Merge results]
|
||||
D5 --> D7
|
||||
D6 --> D7
|
||||
|
||||
D7 --> D8[Deduplicate data]
|
||||
D8 --> D9[Aggregate token usage]
|
||||
|
||||
C3 --> E[Cost Calculation]
|
||||
D9 --> E
|
||||
|
||||
E --> F[Usage Report]
|
||||
F --> F1[Prompt tokens: X]
|
||||
F --> F2[Completion tokens: Y]
|
||||
F --> F3[Total cost: $Z]
|
||||
|
||||
style C fill:#c8e6c9
|
||||
style D fill:#fff3e0
|
||||
style E fill:#e3f2fd
|
||||
style F fill:#f3e5f5
|
||||
```
|
||||
|
||||
**📖 Learn more:** [LLM Strategies](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Pattern Matching](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)
|
||||
@@ -0,0 +1,472 @@
|
||||
## HTTP Crawler Strategy Workflows
|
||||
|
||||
Visual representations of HTTP-based crawling architecture, request flows, and performance characteristics compared to browser-based strategies.
|
||||
|
||||
### HTTP vs Browser Strategy Decision Tree
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Content Crawling Need] --> B{Content Type Analysis}
|
||||
|
||||
B -->|Static HTML| C{JavaScript Required?}
|
||||
B -->|Dynamic SPA| D[Browser Strategy Required]
|
||||
B -->|API Endpoints| E[HTTP Strategy Optimal]
|
||||
B -->|Mixed Content| F{Primary Content Source?}
|
||||
|
||||
C -->|No JS Needed| G[HTTP Strategy Recommended]
|
||||
C -->|JS Required| H[Browser Strategy Required]
|
||||
C -->|Unknown| I{Performance Priority?}
|
||||
|
||||
I -->|Speed Critical| J[Try HTTP First]
|
||||
I -->|Accuracy Critical| K[Use Browser Strategy]
|
||||
|
||||
F -->|Mostly Static| G
|
||||
F -->|Mostly Dynamic| D
|
||||
|
||||
G --> L{Resource Constraints?}
|
||||
L -->|Memory Limited| M[HTTP Strategy - Lightweight]
|
||||
L -->|CPU Limited| N[HTTP Strategy - No Browser]
|
||||
L -->|Network Limited| O[HTTP Strategy - Efficient]
|
||||
L -->|No Constraints| P[Either Strategy Works]
|
||||
|
||||
J --> Q[Test HTTP Results]
|
||||
Q --> R{Content Complete?}
|
||||
R -->|Yes| S[Continue with HTTP]
|
||||
R -->|No| T[Switch to Browser Strategy]
|
||||
|
||||
D --> U[Browser Strategy Features]
|
||||
H --> U
|
||||
K --> U
|
||||
T --> U
|
||||
|
||||
U --> V[JavaScript Execution]
|
||||
U --> W[Screenshots/PDFs]
|
||||
U --> X[Complex Interactions]
|
||||
U --> Y[Session Management]
|
||||
|
||||
M --> Z[HTTP Strategy Benefits]
|
||||
N --> Z
|
||||
O --> Z
|
||||
S --> Z
|
||||
|
||||
Z --> AA[10x Faster Processing]
|
||||
Z --> BB[Lower Memory Usage]
|
||||
Z --> CC[Higher Concurrency]
|
||||
Z --> DD[Simpler Deployment]
|
||||
|
||||
style G fill:#c8e6c9
|
||||
style M fill:#c8e6c9
|
||||
style N fill:#c8e6c9
|
||||
style O fill:#c8e6c9
|
||||
style S fill:#c8e6c9
|
||||
style D fill:#e3f2fd
|
||||
style H fill:#e3f2fd
|
||||
style K fill:#e3f2fd
|
||||
style T fill:#e3f2fd
|
||||
style U fill:#e3f2fd
|
||||
```
|
||||
|
||||
### HTTP Request Lifecycle Sequence
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant HTTPStrategy as HTTP Strategy
|
||||
participant Session as HTTP Session
|
||||
participant Server as Target Server
|
||||
participant Processor as Content Processor
|
||||
|
||||
Client->>HTTPStrategy: crawl(url, config)
|
||||
HTTPStrategy->>HTTPStrategy: validate_url()
|
||||
|
||||
alt URL Type Check
|
||||
HTTPStrategy->>HTTPStrategy: handle_file_url()
|
||||
Note over HTTPStrategy: file:// URLs
|
||||
else
|
||||
HTTPStrategy->>HTTPStrategy: handle_raw_content()
|
||||
Note over HTTPStrategy: raw:// content
|
||||
else
|
||||
HTTPStrategy->>Session: prepare_request()
|
||||
Session->>Session: apply_config()
|
||||
Session->>Session: set_headers()
|
||||
Session->>Session: setup_auth()
|
||||
|
||||
Session->>Server: HTTP Request
|
||||
Note over Session,Server: GET/POST/PUT with headers
|
||||
|
||||
alt Success Response
|
||||
Server-->>Session: HTTP 200 + Content
|
||||
Session-->>HTTPStrategy: response_data
|
||||
else Redirect Response
|
||||
Server-->>Session: HTTP 3xx + Location
|
||||
Session->>Server: Follow redirect
|
||||
Server-->>Session: HTTP 200 + Content
|
||||
Session-->>HTTPStrategy: final_response
|
||||
else Error Response
|
||||
Server-->>Session: HTTP 4xx/5xx
|
||||
Session-->>HTTPStrategy: error_response
|
||||
end
|
||||
end
|
||||
|
||||
HTTPStrategy->>Processor: process_content()
|
||||
Processor->>Processor: clean_html()
|
||||
Processor->>Processor: extract_metadata()
|
||||
Processor->>Processor: generate_markdown()
|
||||
Processor-->>HTTPStrategy: processed_result
|
||||
|
||||
HTTPStrategy-->>Client: CrawlResult
|
||||
|
||||
Note over Client,Processor: Fast, lightweight processing
|
||||
Note over HTTPStrategy: No browser overhead
|
||||
```
|
||||
|
||||
### HTTP Strategy Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "HTTP Crawler Strategy"
|
||||
A[AsyncHTTPCrawlerStrategy] --> B[Session Manager]
|
||||
A --> C[Request Builder]
|
||||
A --> D[Response Handler]
|
||||
A --> E[Error Manager]
|
||||
|
||||
B --> B1[Connection Pool]
|
||||
B --> B2[DNS Cache]
|
||||
B --> B3[SSL Context]
|
||||
|
||||
C --> C1[Headers Builder]
|
||||
C --> C2[Auth Handler]
|
||||
C --> C3[Payload Encoder]
|
||||
|
||||
D --> D1[Content Decoder]
|
||||
D --> D2[Redirect Handler]
|
||||
D --> D3[Status Validator]
|
||||
|
||||
E --> E1[Retry Logic]
|
||||
E --> E2[Timeout Handler]
|
||||
E --> E3[Exception Mapper]
|
||||
end
|
||||
|
||||
subgraph "Content Processing"
|
||||
F[Raw HTML] --> G[HTML Cleaner]
|
||||
G --> H[Markdown Generator]
|
||||
H --> I[Link Extractor]
|
||||
I --> J[Media Extractor]
|
||||
J --> K[Metadata Parser]
|
||||
end
|
||||
|
||||
subgraph "External Resources"
|
||||
L[Target Websites]
|
||||
M[Local Files]
|
||||
N[Raw Content]
|
||||
end
|
||||
|
||||
subgraph "Output"
|
||||
O[CrawlResult]
|
||||
O --> O1[HTML Content]
|
||||
O --> O2[Markdown Text]
|
||||
O --> O3[Extracted Links]
|
||||
O --> O4[Media References]
|
||||
O --> O5[Status Information]
|
||||
end
|
||||
|
||||
A --> F
|
||||
L --> A
|
||||
M --> A
|
||||
N --> A
|
||||
K --> O
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style B fill:#f3e5f5
|
||||
style F fill:#e8f5e8
|
||||
style O fill:#fff3e0
|
||||
```
|
||||
|
||||
### Performance Comparison Flow
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "HTTP Strategy Performance"
|
||||
A1[Request Start] --> A2[DNS Lookup: 50ms]
|
||||
A2 --> A3[TCP Connect: 100ms]
|
||||
A3 --> A4[HTTP Request: 200ms]
|
||||
A4 --> A5[Content Download: 300ms]
|
||||
A5 --> A6[Processing: 50ms]
|
||||
A6 --> A7[Total: ~700ms]
|
||||
end
|
||||
|
||||
subgraph "Browser Strategy Performance"
|
||||
B1[Request Start] --> B2[Browser Launch: 2000ms]
|
||||
B2 --> B3[Page Navigation: 1000ms]
|
||||
B3 --> B4[JS Execution: 500ms]
|
||||
B4 --> B5[Content Rendering: 300ms]
|
||||
B5 --> B6[Processing: 100ms]
|
||||
B6 --> B7[Total: ~3900ms]
|
||||
end
|
||||
|
||||
subgraph "Resource Usage"
|
||||
C1[HTTP Memory: ~50MB]
|
||||
C2[Browser Memory: ~500MB]
|
||||
C3[HTTP CPU: Low]
|
||||
C4[Browser CPU: High]
|
||||
C5[HTTP Concurrency: 100+]
|
||||
C6[Browser Concurrency: 10-20]
|
||||
end
|
||||
|
||||
A7 --> D[5.5x Faster]
|
||||
B7 --> D
|
||||
C1 --> E[10x Less Memory]
|
||||
C2 --> E
|
||||
C5 --> F[5x More Concurrent]
|
||||
C6 --> F
|
||||
|
||||
style A7 fill:#c8e6c9
|
||||
style B7 fill:#ffcdd2
|
||||
style C1 fill:#c8e6c9
|
||||
style C2 fill:#ffcdd2
|
||||
style C5 fill:#c8e6c9
|
||||
style C6 fill:#ffcdd2
|
||||
```
|
||||
|
||||
### HTTP Request Types and Configuration
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> HTTPConfigSetup
|
||||
|
||||
HTTPConfigSetup --> MethodSelection
|
||||
|
||||
MethodSelection --> GET: Simple data retrieval
|
||||
MethodSelection --> POST: Form submission
|
||||
MethodSelection --> PUT: Data upload
|
||||
MethodSelection --> DELETE: Resource removal
|
||||
|
||||
GET --> HeaderSetup: Set Accept headers
|
||||
POST --> PayloadSetup: JSON or form data
|
||||
PUT --> PayloadSetup: File or data upload
|
||||
DELETE --> AuthSetup: Authentication required
|
||||
|
||||
PayloadSetup --> JSONPayload: application/json
|
||||
PayloadSetup --> FormPayload: form-data
|
||||
PayloadSetup --> RawPayload: custom content
|
||||
|
||||
JSONPayload --> HeaderSetup
|
||||
FormPayload --> HeaderSetup
|
||||
RawPayload --> HeaderSetup
|
||||
|
||||
HeaderSetup --> AuthSetup
|
||||
AuthSetup --> SSLSetup
|
||||
SSLSetup --> RedirectSetup
|
||||
RedirectSetup --> RequestExecution
|
||||
|
||||
RequestExecution --> [*]: Request complete
|
||||
|
||||
note right of GET : Default method for most crawling
|
||||
note right of POST : API interactions, form submissions
|
||||
note right of JSONPayload : Structured data transmission
|
||||
note right of HeaderSetup : User-Agent, Accept, Custom headers
|
||||
```
|
||||
|
||||
### Error Handling and Retry Workflow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[HTTP Request] --> B{Response Received?}
|
||||
|
||||
B -->|No| C[Connection Error]
|
||||
B -->|Yes| D{Status Code Check}
|
||||
|
||||
C --> C1{Timeout Error?}
|
||||
C1 -->|Yes| C2[ConnectionTimeoutError]
|
||||
C1 -->|No| C3[Network Error]
|
||||
|
||||
D -->|2xx| E[Success Response]
|
||||
D -->|3xx| F[Redirect Response]
|
||||
D -->|4xx| G[Client Error]
|
||||
D -->|5xx| H[Server Error]
|
||||
|
||||
F --> F1{Follow Redirects?}
|
||||
F1 -->|Yes| F2[Follow Redirect]
|
||||
F1 -->|No| F3[Return Redirect Response]
|
||||
F2 --> A
|
||||
|
||||
G --> G1{Retry on 4xx?}
|
||||
G1 -->|No| G2[HTTPStatusError]
|
||||
G1 -->|Yes| I[Check Retry Count]
|
||||
|
||||
H --> H1{Retry on 5xx?}
|
||||
H1 -->|Yes| I
|
||||
H1 -->|No| H2[HTTPStatusError]
|
||||
|
||||
C2 --> I
|
||||
C3 --> I
|
||||
|
||||
I --> J{Retries < Max?}
|
||||
J -->|No| K[Final Error]
|
||||
J -->|Yes| L[Calculate Backoff]
|
||||
|
||||
L --> M[Wait Backoff Time]
|
||||
M --> N[Increment Retry Count]
|
||||
N --> A
|
||||
|
||||
E --> O[Process Content]
|
||||
F3 --> O
|
||||
O --> P[Return CrawlResult]
|
||||
|
||||
G2 --> Q[Error CrawlResult]
|
||||
H2 --> Q
|
||||
K --> Q
|
||||
|
||||
style E fill:#c8e6c9
|
||||
style P fill:#c8e6c9
|
||||
style G2 fill:#ffcdd2
|
||||
style H2 fill:#ffcdd2
|
||||
style K fill:#ffcdd2
|
||||
style Q fill:#ffcdd2
|
||||
```
|
||||
|
||||
### Batch Processing Architecture
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant BatchManager as Batch Manager
|
||||
participant HTTPPool as Connection Pool
|
||||
participant Workers as HTTP Workers
|
||||
participant Targets as Target Servers
|
||||
|
||||
Client->>BatchManager: batch_crawl(urls)
|
||||
BatchManager->>BatchManager: create_semaphore(max_concurrent)
|
||||
|
||||
loop For each URL batch
|
||||
BatchManager->>HTTPPool: acquire_connection()
|
||||
HTTPPool->>Workers: assign_worker()
|
||||
|
||||
par Concurrent Processing
|
||||
Workers->>Targets: HTTP Request 1
|
||||
Workers->>Targets: HTTP Request 2
|
||||
Workers->>Targets: HTTP Request N
|
||||
end
|
||||
|
||||
par Response Handling
|
||||
Targets-->>Workers: Response 1
|
||||
Targets-->>Workers: Response 2
|
||||
Targets-->>Workers: Response N
|
||||
end
|
||||
|
||||
Workers->>HTTPPool: return_connection()
|
||||
HTTPPool->>BatchManager: batch_results()
|
||||
end
|
||||
|
||||
BatchManager->>BatchManager: aggregate_results()
|
||||
BatchManager-->>Client: final_results()
|
||||
|
||||
Note over Workers,Targets: 20-100 concurrent connections
|
||||
Note over BatchManager: Memory-efficient processing
|
||||
Note over HTTPPool: Connection reuse optimization
|
||||
```
|
||||
|
||||
### Content Type Processing Pipeline
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
A[HTTP Response] --> B{Content-Type Detection}
|
||||
|
||||
B -->|text/html| C[HTML Processing]
|
||||
B -->|application/json| D[JSON Processing]
|
||||
B -->|text/plain| E[Text Processing]
|
||||
B -->|application/xml| F[XML Processing]
|
||||
B -->|Other| G[Binary Processing]
|
||||
|
||||
C --> C1[Parse HTML Structure]
|
||||
C1 --> C2[Extract Text Content]
|
||||
C2 --> C3[Generate Markdown]
|
||||
C3 --> C4[Extract Links/Media]
|
||||
|
||||
D --> D1[Parse JSON Structure]
|
||||
D1 --> D2[Extract Data Fields]
|
||||
D2 --> D3[Format as Readable Text]
|
||||
|
||||
E --> E1[Clean Text Content]
|
||||
E1 --> E2[Basic Formatting]
|
||||
|
||||
F --> F1[Parse XML Structure]
|
||||
F1 --> F2[Extract Text Nodes]
|
||||
F2 --> F3[Convert to Markdown]
|
||||
|
||||
G --> G1[Save Binary Content]
|
||||
G1 --> G2[Generate Metadata]
|
||||
|
||||
C4 --> H[Content Analysis]
|
||||
D3 --> H
|
||||
E2 --> H
|
||||
F3 --> H
|
||||
G2 --> H
|
||||
|
||||
H --> I[Link Extraction]
|
||||
H --> J[Media Detection]
|
||||
H --> K[Metadata Parsing]
|
||||
|
||||
I --> L[CrawlResult Assembly]
|
||||
J --> L
|
||||
K --> L
|
||||
|
||||
L --> M[Final Output]
|
||||
|
||||
style C fill:#e8f5e8
|
||||
style H fill:#fff3e0
|
||||
style L fill:#e3f2fd
|
||||
style M fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Integration with Processing Strategies
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "HTTP Strategy Core"
|
||||
A[HTTP Request] --> B[Raw Content]
|
||||
B --> C[Content Decoder]
|
||||
end
|
||||
|
||||
subgraph "Processing Pipeline"
|
||||
C --> D[HTML Cleaner]
|
||||
D --> E[Markdown Generator]
|
||||
E --> F{Content Filter?}
|
||||
|
||||
F -->|Yes| G[Pruning Filter]
|
||||
F -->|Yes| H[BM25 Filter]
|
||||
F -->|No| I[Raw Markdown]
|
||||
|
||||
G --> J[Fit Markdown]
|
||||
H --> J
|
||||
end
|
||||
|
||||
subgraph "Extraction Strategies"
|
||||
I --> K[CSS Extraction]
|
||||
J --> K
|
||||
I --> L[XPath Extraction]
|
||||
J --> L
|
||||
I --> M[LLM Extraction]
|
||||
J --> M
|
||||
end
|
||||
|
||||
subgraph "Output Generation"
|
||||
K --> N[Structured JSON]
|
||||
L --> N
|
||||
M --> N
|
||||
|
||||
I --> O[Clean Markdown]
|
||||
J --> P[Filtered Content]
|
||||
|
||||
N --> Q[Final CrawlResult]
|
||||
O --> Q
|
||||
P --> Q
|
||||
end
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style C fill:#f3e5f5
|
||||
style E fill:#e8f5e8
|
||||
style Q fill:#c8e6c9
|
||||
```
|
||||
|
||||
**📖 Learn more:** [HTTP vs Browser Strategies](https://docs.crawl4ai.com/core/browser-crawler-config/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Error Handling](https://docs.crawl4ai.com/api/async-webcrawler/)
|
||||
368
docs/md_v2/assets/llm.txt/diagrams/installation.txt
Normal file
@@ -0,0 +1,368 @@
|
||||
## Installation Workflows and Architecture
|
||||
|
||||
Visual representations of Crawl4AI installation processes, deployment options, and system interactions.
|
||||
|
||||
### Installation Decision Flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Start Installation] --> B{Environment Type?}
|
||||
|
||||
B -->|Local Development| C[Basic Python Install]
|
||||
B -->|Production| D[Docker Deployment]
|
||||
B -->|Research/Testing| E[Google Colab]
|
||||
B -->|CI/CD Pipeline| F[Automated Setup]
|
||||
|
||||
C --> C1[pip install crawl4ai]
|
||||
C1 --> C2[crawl4ai-setup]
|
||||
C2 --> C3{Need Advanced Features?}
|
||||
|
||||
C3 -->|No| C4[Basic Installation Complete]
|
||||
C3 -->|Text Clustering| C5[pip install crawl4ai with torch]
|
||||
C3 -->|Transformers| C6[pip install crawl4ai with transformer]
|
||||
C3 -->|All Features| C7[pip install crawl4ai with all]
|
||||
|
||||
C5 --> C8[crawl4ai-download-models]
|
||||
C6 --> C8
|
||||
C7 --> C8
|
||||
C8 --> C9[Advanced Installation Complete]
|
||||
|
||||
D --> D1{Deployment Method?}
|
||||
D1 -->|Pre-built Image| D2[docker pull unclecode/crawl4ai]
|
||||
D1 -->|Docker Compose| D3[Clone repo + docker compose]
|
||||
D1 -->|Custom Build| D4[docker buildx build]
|
||||
|
||||
D2 --> D5[Configure .llm.env]
|
||||
D3 --> D5
|
||||
D4 --> D5
|
||||
D5 --> D6[docker run with ports]
|
||||
D6 --> D7[Docker Deployment Complete]
|
||||
|
||||
E --> E1[Colab pip install]
|
||||
E1 --> E2[playwright install chromium]
|
||||
E2 --> E3[Test basic crawl]
|
||||
E3 --> E4[Colab Setup Complete]
|
||||
|
||||
F --> F1[Automated pip install]
|
||||
F1 --> F2[Automated setup scripts]
|
||||
F2 --> F3[CI/CD Integration Complete]
|
||||
|
||||
C4 --> G[Verify with crawl4ai-doctor]
|
||||
C9 --> G
|
||||
D7 --> H[Health check via API]
|
||||
E4 --> I[Run test crawl]
|
||||
F3 --> G
|
||||
|
||||
G --> J[Installation Verified]
|
||||
H --> J
|
||||
I --> J
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style J fill:#c8e6c9
|
||||
style C4 fill:#fff3e0
|
||||
style C9 fill:#fff3e0
|
||||
style D7 fill:#f3e5f5
|
||||
style E4 fill:#fce4ec
|
||||
style F3 fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Basic Installation Sequence
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant PyPI
|
||||
participant System
|
||||
participant Playwright
|
||||
participant Crawler
|
||||
|
||||
User->>PyPI: pip install crawl4ai
|
||||
PyPI-->>User: Package downloaded
|
||||
|
||||
User->>System: crawl4ai-setup
|
||||
System->>Playwright: Install browser binaries
|
||||
Playwright-->>System: Chromium, Firefox installed
|
||||
System-->>User: Setup complete
|
||||
|
||||
User->>System: crawl4ai-doctor
|
||||
System->>System: Check Python version
|
||||
System->>System: Verify Playwright installation
|
||||
System->>System: Test browser launch
|
||||
System-->>User: Diagnostics report
|
||||
|
||||
User->>Crawler: Basic crawl test
|
||||
Crawler->>Playwright: Launch browser
|
||||
Playwright-->>Crawler: Browser ready
|
||||
Crawler->>Crawler: Navigate to test URL
|
||||
Crawler-->>User: Success confirmation
|
||||
```
|
||||
|
||||
### Docker Deployment Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Host System"
|
||||
A[Docker Engine] --> B[Crawl4AI Container]
|
||||
C[.llm.env File] --> B
|
||||
D[Port 11235] --> B
|
||||
end
|
||||
|
||||
subgraph "Container Environment"
|
||||
B --> E[FastAPI Server]
|
||||
B --> F[Playwright Browsers]
|
||||
B --> G[Python Runtime]
|
||||
|
||||
E --> H[/crawl Endpoint]
|
||||
E --> I[/playground Interface]
|
||||
E --> J[/health Monitoring]
|
||||
E --> K[/metrics Prometheus]
|
||||
|
||||
F --> L[Chromium Browser]
|
||||
F --> M[Firefox Browser]
|
||||
F --> N[WebKit Browser]
|
||||
end
|
||||
|
||||
subgraph "External Services"
|
||||
O[OpenAI API] --> B
|
||||
P[Anthropic API] --> B
|
||||
Q[Local LLM Ollama] --> B
|
||||
end
|
||||
|
||||
subgraph "Client Applications"
|
||||
R[Python SDK] --> H
|
||||
S[REST API Calls] --> H
|
||||
T[Web Browser] --> I
|
||||
U[Monitoring Tools] --> J
|
||||
V[Prometheus] --> K
|
||||
end
|
||||
|
||||
style B fill:#e3f2fd
|
||||
style E fill:#f3e5f5
|
||||
style F fill:#e8f5e8
|
||||
style G fill:#fff3e0
|
||||
```
|
||||
|
||||
### Advanced Features Installation Flow
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> BasicInstall
|
||||
|
||||
BasicInstall --> FeatureChoice: crawl4ai installed
|
||||
|
||||
FeatureChoice --> TorchInstall: Need text clustering
|
||||
FeatureChoice --> TransformerInstall: Need HuggingFace models
|
||||
FeatureChoice --> AllInstall: Need everything
|
||||
FeatureChoice --> Complete: Basic features sufficient
|
||||
|
||||
TorchInstall --> TorchSetup: pip install crawl4ai with torch
|
||||
TransformerInstall --> TransformerSetup: pip install crawl4ai with transformer
|
||||
AllInstall --> AllSetup: pip install crawl4ai with all
|
||||
|
||||
TorchSetup --> ModelDownload: crawl4ai-setup
|
||||
TransformerSetup --> ModelDownload: crawl4ai-setup
|
||||
AllSetup --> ModelDownload: crawl4ai-setup
|
||||
|
||||
ModelDownload --> PreDownload: crawl4ai-download-models
|
||||
PreDownload --> Complete: All models cached
|
||||
|
||||
Complete --> Verification: crawl4ai-doctor
|
||||
Verification --> [*]: Installation verified
|
||||
|
||||
note right of TorchInstall : PyTorch for semantic operations
|
||||
note right of TransformerInstall : HuggingFace for LLM features
|
||||
note right of AllInstall : Complete feature set
|
||||
```
|
||||
|
||||
### Platform-Specific Installation Matrix
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Installation Methods"
|
||||
A[Python Package] --> A1[pip install]
|
||||
B[Docker Image] --> B1[docker pull]
|
||||
C[Source Build] --> C1[git clone + build]
|
||||
D[Cloud Platform] --> D1[Colab/Kaggle]
|
||||
end
|
||||
|
||||
subgraph "Operating Systems"
|
||||
E[Linux x86_64]
|
||||
F[Linux ARM64]
|
||||
G[macOS Intel]
|
||||
H[macOS Apple Silicon]
|
||||
I[Windows x86_64]
|
||||
end
|
||||
|
||||
subgraph "Feature Sets"
|
||||
J[Basic crawling]
|
||||
K[Text clustering torch]
|
||||
L[LLM transformers]
|
||||
M[All features]
|
||||
end
|
||||
|
||||
A1 --> E
|
||||
A1 --> F
|
||||
A1 --> G
|
||||
A1 --> H
|
||||
A1 --> I
|
||||
|
||||
B1 --> E
|
||||
B1 --> F
|
||||
B1 --> G
|
||||
B1 --> H
|
||||
|
||||
C1 --> E
|
||||
C1 --> F
|
||||
C1 --> G
|
||||
C1 --> H
|
||||
C1 --> I
|
||||
|
||||
D1 --> E
|
||||
D1 --> I
|
||||
|
||||
E --> J
|
||||
E --> K
|
||||
E --> L
|
||||
E --> M
|
||||
|
||||
F --> J
|
||||
F --> K
|
||||
F --> L
|
||||
F --> M
|
||||
|
||||
G --> J
|
||||
G --> K
|
||||
G --> L
|
||||
G --> M
|
||||
|
||||
H --> J
|
||||
H --> K
|
||||
H --> L
|
||||
H --> M
|
||||
|
||||
I --> J
|
||||
I --> K
|
||||
I --> L
|
||||
I --> M
|
||||
|
||||
style A1 fill:#e3f2fd
|
||||
style B1 fill:#f3e5f5
|
||||
style C1 fill:#e8f5e8
|
||||
style D1 fill:#fff3e0
|
||||
```
|
||||
|
||||
### Docker Multi-Stage Build Process
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Dev as Developer
|
||||
participant Git as GitHub Repo
|
||||
participant Docker as Docker Engine
|
||||
participant Registry as Docker Hub
|
||||
participant User as End User
|
||||
|
||||
Dev->>Git: Push code changes
|
||||
|
||||
Docker->>Git: Clone repository
|
||||
Docker->>Docker: Stage 1 - Base Python image
|
||||
Docker->>Docker: Stage 2 - Install dependencies
|
||||
Docker->>Docker: Stage 3 - Install Playwright
|
||||
Docker->>Docker: Stage 4 - Copy application code
|
||||
Docker->>Docker: Stage 5 - Setup FastAPI server
|
||||
|
||||
Note over Docker: Multi-architecture build
|
||||
Docker->>Docker: Build for linux/amd64
|
||||
Docker->>Docker: Build for linux/arm64
|
||||
|
||||
Docker->>Registry: Push multi-arch manifest
|
||||
Registry-->>Docker: Build complete
|
||||
|
||||
User->>Registry: docker pull unclecode/crawl4ai
|
||||
Registry-->>User: Download appropriate architecture
|
||||
|
||||
User->>Docker: docker run with configuration
|
||||
Docker->>Docker: Start container
|
||||
Docker->>Docker: Initialize FastAPI server
|
||||
Docker->>Docker: Setup Playwright browsers
|
||||
Docker-->>User: Service ready on port 11235
|
||||
```
|
||||
|
||||
### Installation Verification Workflow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Installation Complete] --> B[Run crawl4ai-doctor]
|
||||
|
||||
B --> C{Python Version Check}
|
||||
C -->|✓ 3.10+| D{Playwright Check}
|
||||
C -->|✗ < 3.10| C1[Upgrade Python]
|
||||
C1 --> D
|
||||
|
||||
D -->|✓ Installed| E{Browser Binaries}
|
||||
D -->|✗ Missing| D1[Run crawl4ai-setup]
|
||||
D1 --> E
|
||||
|
||||
E -->|✓ Available| F{Test Browser Launch}
|
||||
E -->|✗ Missing| E1[playwright install]
|
||||
E1 --> F
|
||||
|
||||
F -->|✓ Success| G[Test Basic Crawl]
|
||||
F -->|✗ Failed| F1[Check system dependencies]
|
||||
F1 --> F
|
||||
|
||||
G --> H{Crawl Test Result}
|
||||
H -->|✓ Success| I[Installation Verified ✓]
|
||||
H -->|✗ Failed| H1[Check network/permissions]
|
||||
H1 --> G
|
||||
|
||||
I --> J[Ready for Production Use]
|
||||
|
||||
style I fill:#c8e6c9
|
||||
style J fill:#e8f5e8
|
||||
style C1 fill:#ffcdd2
|
||||
style D1 fill:#fff3e0
|
||||
style E1 fill:#fff3e0
|
||||
style F1 fill:#ffcdd2
|
||||
style H1 fill:#ffcdd2
|
||||
```
|
||||
|
||||
### Resource Requirements by Installation Type
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Basic Installation"
|
||||
A1[Memory: 512MB]
|
||||
A2[Disk: 2GB]
|
||||
A3[CPU: 1 core]
|
||||
A4[Network: Required for setup]
|
||||
end
|
||||
|
||||
subgraph "Advanced Features torch"
|
||||
B1[Memory: 2GB+]
|
||||
B2[Disk: 5GB+]
|
||||
B3[CPU: 2+ cores]
|
||||
B4[GPU: Optional CUDA]
|
||||
end
|
||||
|
||||
subgraph "All Features"
|
||||
C1[Memory: 4GB+]
|
||||
C2[Disk: 10GB+]
|
||||
C3[CPU: 4+ cores]
|
||||
C4[GPU: Recommended]
|
||||
end
|
||||
|
||||
subgraph "Docker Deployment"
|
||||
D1[Memory: 1GB+]
|
||||
D2[Disk: 3GB+]
|
||||
D3[CPU: 2+ cores]
|
||||
D4[Ports: 11235]
|
||||
D5[Shared Memory: 1GB]
|
||||
end
|
||||
|
||||
style A1 fill:#e8f5e8
|
||||
style B1 fill:#fff3e0
|
||||
style C1 fill:#ffecb3
|
||||
style D1 fill:#e3f2fd
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Installation Guide](https://docs.crawl4ai.com/core/installation/), [Docker Deployment](https://docs.crawl4ai.com/core/docker-deployment/), [System Requirements](https://docs.crawl4ai.com/core/installation/#prerequisites)
|
||||
5912
docs/md_v2/assets/llm.txt/diagrams/llms-diagram.txt
Normal file
392
docs/md_v2/assets/llm.txt/diagrams/multi_urls_crawling.txt
Normal file
@@ -0,0 +1,392 @@
|
||||
## Multi-URL Crawling Workflows and Architecture
|
||||
|
||||
Visual representations of concurrent crawling patterns, resource management, and monitoring systems for handling multiple URLs efficiently.
|
||||
|
||||
### Multi-URL Processing Modes
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Multi-URL Crawling Request] --> B{Processing Mode?}
|
||||
|
||||
B -->|Batch Mode| C[Collect All URLs]
|
||||
B -->|Streaming Mode| D[Process URLs Individually]
|
||||
|
||||
C --> C1[Queue All URLs]
|
||||
C1 --> C2[Execute Concurrently]
|
||||
C2 --> C3[Wait for All Completion]
|
||||
C3 --> C4[Return Complete Results Array]
|
||||
|
||||
D --> D1[Queue URLs]
|
||||
D1 --> D2[Start First Batch]
|
||||
D2 --> D3[Yield Results as Available]
|
||||
D3 --> D4{More URLs?}
|
||||
D4 -->|Yes| D5[Start Next URLs]
|
||||
D4 -->|No| D6[Stream Complete]
|
||||
D5 --> D3
|
||||
|
||||
C4 --> E[Process Results]
|
||||
D6 --> E
|
||||
|
||||
E --> F[Success/Failure Analysis]
|
||||
F --> G[End]
|
||||
|
||||
style C fill:#e3f2fd
|
||||
style D fill:#f3e5f5
|
||||
style C4 fill:#c8e6c9
|
||||
style D6 fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Memory-Adaptive Dispatcher Flow
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> Initializing
|
||||
|
||||
Initializing --> MonitoringMemory: Start dispatcher
|
||||
|
||||
MonitoringMemory --> CheckingMemory: Every check_interval
|
||||
CheckingMemory --> MemoryOK: Memory < threshold
|
||||
CheckingMemory --> MemoryHigh: Memory >= threshold
|
||||
|
||||
MemoryOK --> DispatchingTasks: Start new crawls
|
||||
MemoryHigh --> WaitingForMemory: Pause dispatching
|
||||
|
||||
DispatchingTasks --> TaskRunning: Launch crawler
|
||||
TaskRunning --> TaskCompleted: Crawl finished
|
||||
TaskRunning --> TaskFailed: Crawl error
|
||||
|
||||
TaskCompleted --> MonitoringMemory: Update stats
|
||||
TaskFailed --> MonitoringMemory: Update stats
|
||||
|
||||
WaitingForMemory --> CheckingMemory: Wait timeout
|
||||
WaitingForMemory --> MonitoringMemory: Memory freed
|
||||
|
||||
note right of MemoryHigh: Prevents OOM crashes
|
||||
note right of DispatchingTasks: Respects max_session_permit
|
||||
note right of WaitingForMemory: Configurable timeout
|
||||
```
|
||||
|
||||
### Concurrent Crawling Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "URL Queue Management"
|
||||
A[URL Input List] --> B[URL Queue]
|
||||
B --> C[Priority Scheduler]
|
||||
C --> D[Batch Assignment]
|
||||
end
|
||||
|
||||
subgraph "Dispatcher Layer"
|
||||
E[Memory Adaptive Dispatcher]
|
||||
F[Semaphore Dispatcher]
|
||||
G[Rate Limiter]
|
||||
H[Resource Monitor]
|
||||
|
||||
E --> I[Memory Checker]
|
||||
F --> J[Concurrency Controller]
|
||||
G --> K[Delay Calculator]
|
||||
H --> L[System Stats]
|
||||
end
|
||||
|
||||
subgraph "Crawler Pool"
|
||||
M[Crawler Instance 1]
|
||||
N[Crawler Instance 2]
|
||||
O[Crawler Instance 3]
|
||||
P[Crawler Instance N]
|
||||
|
||||
M --> Q[Browser Session 1]
|
||||
N --> R[Browser Session 2]
|
||||
O --> S[Browser Session 3]
|
||||
P --> T[Browser Session N]
|
||||
end
|
||||
|
||||
subgraph "Result Processing"
|
||||
U[Result Collector]
|
||||
V[Success Handler]
|
||||
W[Error Handler]
|
||||
X[Retry Queue]
|
||||
Y[Final Results]
|
||||
end
|
||||
|
||||
D --> E
|
||||
D --> F
|
||||
E --> M
|
||||
F --> N
|
||||
G --> O
|
||||
H --> P
|
||||
|
||||
Q --> U
|
||||
R --> U
|
||||
S --> U
|
||||
T --> U
|
||||
|
||||
U --> V
|
||||
U --> W
|
||||
W --> X
|
||||
X --> B
|
||||
V --> Y
|
||||
|
||||
style E fill:#e3f2fd
|
||||
style F fill:#f3e5f5
|
||||
style G fill:#e8f5e8
|
||||
style H fill:#fff3e0
|
||||
```
|
||||
|
||||
### Rate Limiting and Backoff Strategy
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant C as Crawler
|
||||
participant RL as Rate Limiter
|
||||
participant S as Server
|
||||
participant D as Dispatcher
|
||||
|
||||
C->>RL: Request to crawl URL
|
||||
RL->>RL: Calculate delay
|
||||
RL->>RL: Apply base delay (1-3s)
|
||||
RL->>C: Delay applied
|
||||
|
||||
C->>S: HTTP Request
|
||||
|
||||
alt Success Response
|
||||
S-->>C: 200 OK + Content
|
||||
C->>RL: Report success
|
||||
RL->>RL: Reset failure count
|
||||
C->>D: Return successful result
|
||||
else Rate Limited
|
||||
S-->>C: 429 Too Many Requests
|
||||
C->>RL: Report rate limit
|
||||
RL->>RL: Exponential backoff
|
||||
RL->>RL: Increase delay (up to max_delay)
|
||||
RL->>C: Apply longer delay
|
||||
C->>S: Retry request after delay
|
||||
else Server Error
|
||||
S-->>C: 503 Service Unavailable
|
||||
C->>RL: Report server error
|
||||
RL->>RL: Moderate backoff
|
||||
RL->>C: Retry with backoff
|
||||
else Max Retries Exceeded
|
||||
RL->>C: Stop retrying
|
||||
C->>D: Return failed result
|
||||
end
|
||||
```
|
||||
|
||||
### Large-Scale Crawling Workflow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Load URL List 10k+ URLs] --> B[Initialize Dispatcher]
|
||||
|
||||
B --> C{Select Dispatcher Type}
|
||||
C -->|Memory Constrained| D[Memory Adaptive]
|
||||
C -->|Fixed Resources| E[Semaphore Based]
|
||||
|
||||
D --> F[Set Memory Threshold 70%]
|
||||
E --> G[Set Concurrency Limit]
|
||||
|
||||
F --> H[Configure Monitoring]
|
||||
G --> H
|
||||
|
||||
H --> I[Start Crawling Process]
|
||||
I --> J[Monitor System Resources]
|
||||
|
||||
J --> K{Memory Usage?}
|
||||
K -->|< Threshold| L[Continue Dispatching]
|
||||
K -->|>= Threshold| M[Pause New Tasks]
|
||||
|
||||
L --> N[Process Results Stream]
|
||||
M --> O[Wait for Memory]
|
||||
O --> K
|
||||
|
||||
N --> P{Result Type?}
|
||||
P -->|Success| Q[Save to Database]
|
||||
P -->|Failure| R[Log Error]
|
||||
|
||||
Q --> S[Update Progress Counter]
|
||||
R --> S
|
||||
|
||||
S --> T{More URLs?}
|
||||
T -->|Yes| U[Get Next Batch]
|
||||
T -->|No| V[Generate Final Report]
|
||||
|
||||
U --> L
|
||||
V --> W[Analysis Complete]
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style D fill:#e8f5e8
|
||||
style E fill:#f3e5f5
|
||||
style V fill:#c8e6c9
|
||||
style W fill:#a5d6a7
|
||||
```
|
||||
|
||||
### Real-Time Monitoring Dashboard Flow
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Data Collection"
|
||||
A[Crawler Tasks] --> B[Performance Metrics]
|
||||
A --> C[Memory Usage]
|
||||
A --> D[Success/Failure Rates]
|
||||
A --> E[Response Times]
|
||||
end
|
||||
|
||||
subgraph "Monitor Processing"
|
||||
F[CrawlerMonitor] --> G[Aggregate Statistics]
|
||||
F --> H[Display Formatter]
|
||||
F --> I[Update Scheduler]
|
||||
end
|
||||
|
||||
subgraph "Display Modes"
|
||||
J[DETAILED Mode]
|
||||
K[AGGREGATED Mode]
|
||||
|
||||
J --> L[Individual Task Status]
|
||||
J --> M[Task-Level Metrics]
|
||||
K --> N[Summary Statistics]
|
||||
K --> O[Overall Progress]
|
||||
end
|
||||
|
||||
subgraph "Output Interface"
|
||||
P[Console Display]
|
||||
Q[Progress Bars]
|
||||
R[Status Tables]
|
||||
S[Real-time Updates]
|
||||
end
|
||||
|
||||
B --> F
|
||||
C --> F
|
||||
D --> F
|
||||
E --> F
|
||||
|
||||
G --> J
|
||||
G --> K
|
||||
H --> J
|
||||
H --> K
|
||||
I --> J
|
||||
I --> K
|
||||
|
||||
L --> P
|
||||
M --> Q
|
||||
N --> R
|
||||
O --> S
|
||||
|
||||
style F fill:#e3f2fd
|
||||
style J fill:#f3e5f5
|
||||
style K fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Error Handling and Recovery Pattern
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> ProcessingURL
|
||||
|
||||
ProcessingURL --> CrawlAttempt: Start crawl
|
||||
|
||||
CrawlAttempt --> Success: HTTP 200
|
||||
CrawlAttempt --> NetworkError: Connection failed
|
||||
CrawlAttempt --> RateLimit: HTTP 429
|
||||
CrawlAttempt --> ServerError: HTTP 5xx
|
||||
CrawlAttempt --> Timeout: Request timeout
|
||||
|
||||
Success --> [*]: Return result
|
||||
|
||||
NetworkError --> RetryCheck: Check retry count
|
||||
RateLimit --> BackoffWait: Apply exponential backoff
|
||||
ServerError --> RetryCheck: Check retry count
|
||||
Timeout --> RetryCheck: Check retry count
|
||||
|
||||
BackoffWait --> RetryCheck: After delay
|
||||
|
||||
RetryCheck --> CrawlAttempt: retries < max_retries
|
||||
RetryCheck --> Failed: retries >= max_retries
|
||||
|
||||
Failed --> ErrorLog: Log failure details
|
||||
ErrorLog --> [*]: Return failed result
|
||||
|
||||
note right of BackoffWait: Exponential backoff for rate limits
|
||||
note right of RetryCheck: Configurable max_retries
|
||||
note right of ErrorLog: Detailed error tracking
|
||||
```
|
||||
|
||||
### Resource Management Timeline
|
||||
|
||||
```mermaid
|
||||
gantt
|
||||
title Multi-URL Crawling Resource Management
|
||||
dateFormat X
|
||||
axisFormat %s
|
||||
|
||||
section Memory Usage
|
||||
Initialize Dispatcher :0, 1
|
||||
Memory Monitoring :1, 10
|
||||
Peak Usage Period :3, 7
|
||||
Memory Cleanup :7, 9
|
||||
|
||||
section Task Execution
|
||||
URL Queue Setup :0, 2
|
||||
Batch 1 Processing :2, 5
|
||||
Batch 2 Processing :4, 7
|
||||
Batch 3 Processing :6, 9
|
||||
Final Results :9, 10
|
||||
|
||||
section Rate Limiting
|
||||
Normal Delays :2, 4
|
||||
Backoff Period :4, 6
|
||||
Recovery Period :6, 8
|
||||
|
||||
section Monitoring
|
||||
System Health Check :0, 10
|
||||
Progress Updates :1, 9
|
||||
Performance Metrics :2, 8
|
||||
```
|
||||
|
||||
### Concurrent Processing Performance Matrix
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Input Factors"
|
||||
A[Number of URLs]
|
||||
B[Concurrency Level]
|
||||
C[Memory Threshold]
|
||||
D[Rate Limiting]
|
||||
end
|
||||
|
||||
subgraph "Processing Characteristics"
|
||||
A --> E[Low 1-100 URLs]
|
||||
A --> F[Medium 100-1k URLs]
|
||||
A --> G[High 1k-10k URLs]
|
||||
A --> H[Very High 10k+ URLs]
|
||||
|
||||
B --> I[Conservative 1-5]
|
||||
B --> J[Moderate 5-15]
|
||||
B --> K[Aggressive 15-30]
|
||||
|
||||
C --> L[Strict 60-70%]
|
||||
C --> M[Balanced 70-80%]
|
||||
C --> N[Relaxed 80-90%]
|
||||
end
|
||||
|
||||
subgraph "Recommended Configurations"
|
||||
E --> O[Simple Semaphore]
|
||||
F --> P[Memory Adaptive Basic]
|
||||
G --> Q[Memory Adaptive Advanced]
|
||||
H --> R[Memory Adaptive + Monitoring]
|
||||
|
||||
I --> O
|
||||
J --> P
|
||||
K --> Q
|
||||
K --> R
|
||||
|
||||
L --> Q
|
||||
M --> P
|
||||
N --> O
|
||||
end
|
||||
|
||||
style O fill:#c8e6c9
|
||||
style P fill:#fff3e0
|
||||
style Q fill:#ffecb3
|
||||
style R fill:#ffcdd2
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Multi-URL Crawling Guide](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Dispatcher Configuration](https://docs.crawl4ai.com/advanced/crawl-dispatcher/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/#performance-optimization)
|
||||
411
docs/md_v2/assets/llm.txt/diagrams/simple_crawling.txt
Normal file
@@ -0,0 +1,411 @@
|
||||
## Simple Crawling Workflows and Data Flow
|
||||
|
||||
Visual representations of basic web crawling operations, configuration patterns, and result processing workflows.
|
||||
|
||||
### Basic Crawling Sequence
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Crawler as AsyncWebCrawler
|
||||
participant Browser as Browser Instance
|
||||
participant Page as Web Page
|
||||
participant Processor as Content Processor
|
||||
|
||||
User->>Crawler: Create with BrowserConfig
|
||||
Crawler->>Browser: Launch browser instance
|
||||
Browser-->>Crawler: Browser ready
|
||||
|
||||
User->>Crawler: arun(url, CrawlerRunConfig)
|
||||
Crawler->>Browser: Create new page/context
|
||||
Browser->>Page: Navigate to URL
|
||||
Page-->>Browser: Page loaded
|
||||
|
||||
Browser->>Processor: Extract raw HTML
|
||||
Processor->>Processor: Clean HTML
|
||||
Processor->>Processor: Generate markdown
|
||||
Processor->>Processor: Extract media/links
|
||||
Processor-->>Crawler: CrawlResult created
|
||||
|
||||
Crawler-->>User: Return CrawlResult
|
||||
|
||||
Note over User,Processor: All processing happens asynchronously
|
||||
```
|
||||
|
||||
### Crawling Configuration Flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Start Crawling] --> B{Browser Config Set?}
|
||||
|
||||
B -->|No| B1[Use Default BrowserConfig]
|
||||
B -->|Yes| B2[Custom BrowserConfig]
|
||||
|
||||
B1 --> C[Launch Browser]
|
||||
B2 --> C
|
||||
|
||||
C --> D{Crawler Run Config Set?}
|
||||
|
||||
D -->|No| D1[Use Default CrawlerRunConfig]
|
||||
D -->|Yes| D2[Custom CrawlerRunConfig]
|
||||
|
||||
D1 --> E[Navigate to URL]
|
||||
D2 --> E
|
||||
|
||||
E --> F{Page Load Success?}
|
||||
F -->|No| F1[Return Error Result]
|
||||
F -->|Yes| G[Apply Content Filters]
|
||||
|
||||
G --> G1{excluded_tags set?}
|
||||
G1 -->|Yes| G2[Remove specified tags]
|
||||
G1 -->|No| G3[Keep all tags]
|
||||
G2 --> G4{css_selector set?}
|
||||
G3 --> G4
|
||||
|
||||
G4 -->|Yes| G5[Extract selected elements]
|
||||
G4 -->|No| G6[Process full page]
|
||||
G5 --> H[Generate Markdown]
|
||||
G6 --> H
|
||||
|
||||
H --> H1{markdown_generator set?}
|
||||
H1 -->|Yes| H2[Use custom generator]
|
||||
H1 -->|No| H3[Use default generator]
|
||||
H2 --> I[Extract Media and Links]
|
||||
H3 --> I
|
||||
|
||||
I --> I1{process_iframes?}
|
||||
I1 -->|Yes| I2[Include iframe content]
|
||||
I1 -->|No| I3[Skip iframes]
|
||||
I2 --> J[Create CrawlResult]
|
||||
I3 --> J
|
||||
|
||||
J --> K[Return Result]
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style K fill:#c8e6c9
|
||||
style F1 fill:#ffcdd2
|
||||
```
|
||||
|
||||
### CrawlResult Data Structure
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "CrawlResult Object"
|
||||
A[CrawlResult] --> B[Basic Info]
|
||||
A --> C[Content Variants]
|
||||
A --> D[Extracted Data]
|
||||
A --> E[Media Assets]
|
||||
A --> F[Optional Outputs]
|
||||
|
||||
B --> B1[url: Final URL]
|
||||
B --> B2[success: Boolean]
|
||||
B --> B3[status_code: HTTP Status]
|
||||
B --> B4[error_message: Error Details]
|
||||
|
||||
C --> C1[html: Raw HTML]
|
||||
C --> C2[cleaned_html: Sanitized HTML]
|
||||
C --> C3[markdown: MarkdownGenerationResult]
|
||||
|
||||
C3 --> C3A[raw_markdown: Basic conversion]
|
||||
C3 --> C3B[markdown_with_citations: With references]
|
||||
C3 --> C3C[fit_markdown: Filtered content]
|
||||
C3 --> C3D[references_markdown: Citation list]
|
||||
|
||||
D --> D1[links: Internal/External]
|
||||
D --> D2[media: Images/Videos/Audio]
|
||||
D --> D3[metadata: Page info]
|
||||
D --> D4[extracted_content: JSON data]
|
||||
D --> D5[tables: Structured table data]
|
||||
|
||||
E --> E1[screenshot: Base64 image]
|
||||
E --> E2[pdf: PDF bytes]
|
||||
E --> E3[mhtml: Archive file]
|
||||
E --> E4[downloaded_files: File paths]
|
||||
|
||||
F --> F1[session_id: Browser session]
|
||||
F --> F2[ssl_certificate: Security info]
|
||||
F --> F3[response_headers: HTTP headers]
|
||||
F --> F4[network_requests: Traffic log]
|
||||
F --> F5[console_messages: Browser logs]
|
||||
end
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style C3 fill:#f3e5f5
|
||||
style D5 fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Content Processing Pipeline
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph "Input Sources"
|
||||
A1[Web URL]
|
||||
A2[Raw HTML]
|
||||
A3[Local File]
|
||||
end
|
||||
|
||||
A1 --> B[Browser Navigation]
|
||||
A2 --> C[Direct Processing]
|
||||
A3 --> C
|
||||
|
||||
B --> D[Raw HTML Capture]
|
||||
C --> D
|
||||
|
||||
D --> E{Content Filtering}
|
||||
|
||||
E --> E1[Remove Scripts/Styles]
|
||||
E --> E2[Apply excluded_tags]
|
||||
E --> E3[Apply css_selector]
|
||||
E --> E4[Remove overlay elements]
|
||||
|
||||
E1 --> F[Cleaned HTML]
|
||||
E2 --> F
|
||||
E3 --> F
|
||||
E4 --> F
|
||||
|
||||
F --> G{Markdown Generation}
|
||||
|
||||
G --> G1[HTML to Markdown]
|
||||
G --> G2[Apply Content Filter]
|
||||
G --> G3[Generate Citations]
|
||||
|
||||
G1 --> H[MarkdownGenerationResult]
|
||||
G2 --> H
|
||||
G3 --> H
|
||||
|
||||
F --> I{Media Extraction}
|
||||
I --> I1[Find Images]
|
||||
I --> I2[Find Videos/Audio]
|
||||
I --> I3[Score Relevance]
|
||||
I1 --> J[Media Dictionary]
|
||||
I2 --> J
|
||||
I3 --> J
|
||||
|
||||
F --> K{Link Extraction}
|
||||
K --> K1[Internal Links]
|
||||
K --> K2[External Links]
|
||||
K --> K3[Apply Link Filters]
|
||||
K1 --> L[Links Dictionary]
|
||||
K2 --> L
|
||||
K3 --> L
|
||||
|
||||
H --> M[Final CrawlResult]
|
||||
J --> M
|
||||
L --> M
|
||||
|
||||
style D fill:#e3f2fd
|
||||
style F fill:#f3e5f5
|
||||
style H fill:#e8f5e8
|
||||
style M fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Table Extraction Workflow
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> DetectTables
|
||||
|
||||
DetectTables --> ScoreTables: Find table elements
|
||||
|
||||
ScoreTables --> EvaluateThreshold: Calculate quality scores
|
||||
EvaluateThreshold --> PassThreshold: score >= table_score_threshold
|
||||
EvaluateThreshold --> RejectTable: score < threshold
|
||||
|
||||
PassThreshold --> ExtractHeaders: Parse table structure
|
||||
ExtractHeaders --> ExtractRows: Get header cells
|
||||
ExtractRows --> ExtractMetadata: Get data rows
|
||||
ExtractMetadata --> CreateTableObject: Get caption/summary
|
||||
|
||||
CreateTableObject --> AddToResult: {headers, rows, caption, summary}
|
||||
AddToResult --> [*]: Table extraction complete
|
||||
|
||||
RejectTable --> [*]: Table skipped
|
||||
|
||||
note right of ScoreTables : Factors: header presence, data density, structure quality
|
||||
note right of EvaluateThreshold : Threshold 1-10, higher = stricter
|
||||
```
|
||||
|
||||
### Error Handling Decision Tree
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Start Crawl] --> B[Navigate to URL]
|
||||
|
||||
B --> C{Navigation Success?}
|
||||
C -->|Network Error| C1[Set error_message: Network failure]
|
||||
C -->|Timeout| C2[Set error_message: Page timeout]
|
||||
C -->|Invalid URL| C3[Set error_message: Invalid URL format]
|
||||
C -->|Success| D[Process Page Content]
|
||||
|
||||
C1 --> E[success = False]
|
||||
C2 --> E
|
||||
C3 --> E
|
||||
|
||||
D --> F{Content Processing OK?}
|
||||
F -->|Parser Error| F1[Set error_message: HTML parsing failed]
|
||||
F -->|Memory Error| F2[Set error_message: Insufficient memory]
|
||||
F -->|Success| G[Generate Outputs]
|
||||
|
||||
F1 --> E
|
||||
F2 --> E
|
||||
|
||||
G --> H{Output Generation OK?}
|
||||
H -->|Markdown Error| H1[Partial success with warnings]
|
||||
H -->|Extraction Error| H2[Partial success with warnings]
|
||||
H -->|Success| I[success = True]
|
||||
|
||||
H1 --> I
|
||||
H2 --> I
|
||||
|
||||
E --> J[Return Failed CrawlResult]
|
||||
I --> K[Return Successful CrawlResult]
|
||||
|
||||
J --> L[User Error Handling]
|
||||
K --> M[User Result Processing]
|
||||
|
||||
L --> L1{Check error_message}
|
||||
L1 -->|Network| L2[Retry with different config]
|
||||
L1 -->|Timeout| L3[Increase page_timeout]
|
||||
L1 -->|Parser| L4[Try different scraping_strategy]
|
||||
|
||||
style E fill:#ffcdd2
|
||||
style I fill:#c8e6c9
|
||||
style J fill:#ffcdd2
|
||||
style K fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Configuration Impact Matrix
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Configuration Categories"
|
||||
A[Content Processing]
|
||||
B[Page Interaction]
|
||||
C[Output Generation]
|
||||
D[Performance]
|
||||
end
|
||||
|
||||
subgraph "Configuration Options"
|
||||
A --> A1[word_count_threshold]
|
||||
A --> A2[excluded_tags]
|
||||
A --> A3[css_selector]
|
||||
A --> A4[exclude_external_links]
|
||||
|
||||
B --> B1[process_iframes]
|
||||
B --> B2[remove_overlay_elements]
|
||||
B --> B3[scan_full_page]
|
||||
B --> B4[wait_for]
|
||||
|
||||
C --> C1[screenshot]
|
||||
C --> C2[pdf]
|
||||
C --> C3[markdown_generator]
|
||||
C --> C4[table_score_threshold]
|
||||
|
||||
D --> D1[cache_mode]
|
||||
D --> D2[verbose]
|
||||
D --> D3[page_timeout]
|
||||
D --> D4[semaphore_count]
|
||||
end
|
||||
|
||||
subgraph "Result Impact"
|
||||
A1 --> R1[Filters short text blocks]
|
||||
A2 --> R2[Removes specified HTML tags]
|
||||
A3 --> R3[Focuses on selected content]
|
||||
A4 --> R4[Cleans links dictionary]
|
||||
|
||||
B1 --> R5[Includes iframe content]
|
||||
B2 --> R6[Removes popups/modals]
|
||||
B3 --> R7[Loads dynamic content]
|
||||
B4 --> R8[Waits for specific elements]
|
||||
|
||||
C1 --> R9[Adds screenshot field]
|
||||
C2 --> R10[Adds pdf field]
|
||||
C3 --> R11[Custom markdown processing]
|
||||
C4 --> R12[Filters table quality]
|
||||
|
||||
D1 --> R13[Controls caching behavior]
|
||||
D2 --> R14[Detailed logging output]
|
||||
D3 --> R15[Prevents timeout errors]
|
||||
D4 --> R16[Limits concurrent operations]
|
||||
end
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style B fill:#f3e5f5
|
||||
style C fill:#e8f5e8
|
||||
style D fill:#fff3e0
|
||||
```
|
||||
|
||||
### Raw HTML and Local File Processing
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Crawler
|
||||
participant Processor
|
||||
participant FileSystem
|
||||
|
||||
Note over User,FileSystem: Raw HTML Processing
|
||||
User->>Crawler: arun("raw://html_content")
|
||||
Crawler->>Processor: Parse raw HTML directly
|
||||
Processor->>Processor: Apply same content filters
|
||||
Processor-->>Crawler: Standard CrawlResult
|
||||
Crawler-->>User: Result with markdown
|
||||
|
||||
Note over User,FileSystem: Local File Processing
|
||||
User->>Crawler: arun("file:///path/to/file.html")
|
||||
Crawler->>FileSystem: Read local file
|
||||
FileSystem-->>Crawler: File content
|
||||
Crawler->>Processor: Process file HTML
|
||||
Processor->>Processor: Apply content processing
|
||||
Processor-->>Crawler: Standard CrawlResult
|
||||
Crawler-->>User: Result with markdown
|
||||
|
||||
Note over User,FileSystem: Both return identical CrawlResult structure
|
||||
```
|
||||
|
||||
### Comprehensive Processing Example Flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Input: example.com] --> B[Create Configurations]
|
||||
|
||||
B --> B1[BrowserConfig verbose=True]
|
||||
B --> B2[CrawlerRunConfig with filters]
|
||||
|
||||
B1 --> C[Launch AsyncWebCrawler]
|
||||
B2 --> C
|
||||
|
||||
C --> D[Navigate and Process]
|
||||
|
||||
D --> E{Check Success}
|
||||
E -->|Failed| E1[Print Error Message]
|
||||
E -->|Success| F[Extract Content Summary]
|
||||
|
||||
F --> F1[Get Page Title]
|
||||
F --> F2[Get Content Preview]
|
||||
F --> F3[Process Media Items]
|
||||
F --> F4[Process Links]
|
||||
|
||||
F3 --> F3A[Count Images]
|
||||
F3 --> F3B[Show First 3 Images]
|
||||
|
||||
F4 --> F4A[Count Internal Links]
|
||||
F4 --> F4B[Show First 3 Links]
|
||||
|
||||
F1 --> G[Display Results]
|
||||
F2 --> G
|
||||
F3A --> G
|
||||
F3B --> G
|
||||
F4A --> G
|
||||
F4B --> G
|
||||
|
||||
E1 --> H[End with Error]
|
||||
G --> I[End with Success]
|
||||
|
||||
style E1 fill:#ffcdd2
|
||||
style G fill:#c8e6c9
|
||||
style H fill:#ffcdd2
|
||||
style I fill:#c8e6c9
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Simple Crawling Guide](https://docs.crawl4ai.com/core/simple-crawling/), [Configuration Options](https://docs.crawl4ai.com/core/browser-crawler-config/), [Result Processing](https://docs.crawl4ai.com/core/crawler-result/), [Table Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/)
|
||||
441
docs/md_v2/assets/llm.txt/diagrams/url_seeder.txt
Normal file
@@ -0,0 +1,441 @@
|
||||
## URL Seeding Workflows and Architecture
|
||||
|
||||
Visual representations of URL discovery strategies, filtering pipelines, and smart crawling workflows.
|
||||
|
||||
### URL Seeding vs Deep Crawling Strategy Comparison
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Deep Crawling Approach"
|
||||
A1[Start URL] --> A2[Load Page]
|
||||
A2 --> A3[Extract Links]
|
||||
A3 --> A4{More Links?}
|
||||
A4 -->|Yes| A5[Queue Next Page]
|
||||
A5 --> A2
|
||||
A4 -->|No| A6[Complete]
|
||||
|
||||
A7[⏱️ Real-time Discovery]
|
||||
A8[🐌 Sequential Processing]
|
||||
A9[🔍 Limited by Page Structure]
|
||||
A10[💾 High Memory Usage]
|
||||
end
|
||||
|
||||
subgraph "URL Seeding Approach"
|
||||
B1[Domain Input] --> B2[Query Sitemap]
|
||||
B1 --> B3[Query Common Crawl]
|
||||
B2 --> B4[Merge Results]
|
||||
B3 --> B4
|
||||
B4 --> B5[Apply Filters]
|
||||
B5 --> B6[Score Relevance]
|
||||
B6 --> B7[Rank Results]
|
||||
B7 --> B8[Select Top URLs]
|
||||
|
||||
B9[⚡ Instant Discovery]
|
||||
B10[🚀 Parallel Processing]
|
||||
B11[🎯 Pattern-based Filtering]
|
||||
B12[💡 Smart Relevance Scoring]
|
||||
end
|
||||
|
||||
style A1 fill:#ffecb3
|
||||
style B1 fill:#e8f5e8
|
||||
style A6 fill:#ffcdd2
|
||||
style B8 fill:#c8e6c9
|
||||
```
|
||||
|
||||
### URL Discovery Data Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Seeder as AsyncUrlSeeder
|
||||
participant SM as Sitemap
|
||||
participant CC as Common Crawl
|
||||
participant Filter as URL Filter
|
||||
participant Scorer as BM25 Scorer
|
||||
|
||||
User->>Seeder: urls("example.com", config)
|
||||
|
||||
par Parallel Data Sources
|
||||
Seeder->>SM: Fetch sitemap.xml
|
||||
SM-->>Seeder: 500 URLs
|
||||
and
|
||||
Seeder->>CC: Query Common Crawl
|
||||
CC-->>Seeder: 2000 URLs
|
||||
end
|
||||
|
||||
Seeder->>Seeder: Merge and deduplicate
|
||||
Note over Seeder: 2200 unique URLs
|
||||
|
||||
Seeder->>Filter: Apply pattern filter
|
||||
Filter-->>Seeder: 800 matching URLs
|
||||
|
||||
alt extract_head=True
|
||||
loop For each URL
|
||||
Seeder->>Seeder: Extract <head> metadata
|
||||
end
|
||||
Note over Seeder: Title, description, keywords
|
||||
end
|
||||
|
||||
alt query provided
|
||||
Seeder->>Scorer: Calculate relevance scores
|
||||
Scorer-->>Seeder: Scored URLs
|
||||
Seeder->>Seeder: Filter by score_threshold
|
||||
Note over Seeder: 200 relevant URLs
|
||||
end
|
||||
|
||||
Seeder->>Seeder: Sort by relevance
|
||||
Seeder->>Seeder: Apply max_urls limit
|
||||
Seeder-->>User: Top 100 URLs ready for crawling
|
||||
```
|
||||
|
||||
### SeedingConfig Decision Tree
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[SeedingConfig Setup] --> B{Data Source Strategy?}
|
||||
|
||||
B -->|Fast & Official| C[source="sitemap"]
|
||||
B -->|Comprehensive| D[source="cc"]
|
||||
B -->|Maximum Coverage| E[source="sitemap+cc"]
|
||||
|
||||
C --> F{Need Filtering?}
|
||||
D --> F
|
||||
E --> F
|
||||
|
||||
F -->|Yes| G[Set URL Pattern]
|
||||
F -->|No| H[pattern="*"]
|
||||
|
||||
G --> I{Pattern Examples}
|
||||
I --> I1[pattern="*/blog/*"]
|
||||
I --> I2[pattern="*/docs/api/*"]
|
||||
I --> I3[pattern="*.pdf"]
|
||||
I --> I4[pattern="*/product/*"]
|
||||
|
||||
H --> J{Need Metadata?}
|
||||
I1 --> J
|
||||
I2 --> J
|
||||
I3 --> J
|
||||
I4 --> J
|
||||
|
||||
J -->|Yes| K[extract_head=True]
|
||||
J -->|No| L[extract_head=False]
|
||||
|
||||
K --> M{Need Validation?}
|
||||
L --> M
|
||||
|
||||
M -->|Yes| N[live_check=True]
|
||||
M -->|No| O[live_check=False]
|
||||
|
||||
N --> P{Need Relevance Scoring?}
|
||||
O --> P
|
||||
|
||||
P -->|Yes| Q[Set Query + BM25]
|
||||
P -->|No| R[Skip Scoring]
|
||||
|
||||
Q --> S[query="search terms"]
|
||||
S --> T[scoring_method="bm25"]
|
||||
T --> U[score_threshold=0.3]
|
||||
|
||||
R --> V[Performance Tuning]
|
||||
U --> V
|
||||
|
||||
V --> W[Set max_urls]
|
||||
W --> X[Set concurrency]
|
||||
X --> Y[Set hits_per_sec]
|
||||
Y --> Z[Configuration Complete]
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style Z fill:#c8e6c9
|
||||
style K fill:#fff3e0
|
||||
style N fill:#fff3e0
|
||||
style Q fill:#f3e5f5
|
||||
```
|
||||
|
||||
### BM25 Relevance Scoring Pipeline
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Text Corpus Preparation"
|
||||
A1[URL Collection] --> A2[Extract Metadata]
|
||||
A2 --> A3[Title + Description + Keywords]
|
||||
A3 --> A4[Tokenize Text]
|
||||
A4 --> A5[Remove Stop Words]
|
||||
A5 --> A6[Create Document Corpus]
|
||||
end
|
||||
|
||||
subgraph "BM25 Algorithm"
|
||||
B1[Query Terms] --> B2[Term Frequency Calculation]
|
||||
A6 --> B2
|
||||
B2 --> B3[Inverse Document Frequency]
|
||||
B3 --> B4[BM25 Score Calculation]
|
||||
B4 --> B5[Score = Σ(IDF × TF × K1+1)/(TF + K1×(1-b+b×|d|/avgdl))]
|
||||
end
|
||||
|
||||
subgraph "Scoring Results"
|
||||
B5 --> C1[URL Relevance Scores]
|
||||
C1 --> C2{Score ≥ Threshold?}
|
||||
C2 -->|Yes| C3[Include in Results]
|
||||
C2 -->|No| C4[Filter Out]
|
||||
C3 --> C5[Sort by Score DESC]
|
||||
C5 --> C6[Return Top URLs]
|
||||
end
|
||||
|
||||
subgraph "Example Scores"
|
||||
D1["python async tutorial" → 0.85]
|
||||
D2["python documentation" → 0.72]
|
||||
D3["javascript guide" → 0.23]
|
||||
D4["contact us page" → 0.05]
|
||||
end
|
||||
|
||||
style B5 fill:#e3f2fd
|
||||
style C6 fill:#c8e6c9
|
||||
style D1 fill:#c8e6c9
|
||||
style D2 fill:#c8e6c9
|
||||
style D3 fill:#ffecb3
|
||||
style D4 fill:#ffcdd2
|
||||
```
|
||||
|
||||
### Multi-Domain Discovery Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Input Layer"
|
||||
A1[Domain List]
|
||||
A2[SeedingConfig]
|
||||
A3[Query Terms]
|
||||
end
|
||||
|
||||
subgraph "Discovery Engine"
|
||||
B1[AsyncUrlSeeder]
|
||||
B2[Parallel Workers]
|
||||
B3[Rate Limiter]
|
||||
B4[Memory Manager]
|
||||
end
|
||||
|
||||
subgraph "Data Sources"
|
||||
C1[Sitemap Fetcher]
|
||||
C2[Common Crawl API]
|
||||
C3[Live URL Checker]
|
||||
C4[Metadata Extractor]
|
||||
end
|
||||
|
||||
subgraph "Processing Pipeline"
|
||||
D1[URL Deduplication]
|
||||
D2[Pattern Filtering]
|
||||
D3[Relevance Scoring]
|
||||
D4[Quality Assessment]
|
||||
end
|
||||
|
||||
subgraph "Output Layer"
|
||||
E1[Scored URL Lists]
|
||||
E2[Domain Statistics]
|
||||
E3[Performance Metrics]
|
||||
E4[Cache Storage]
|
||||
end
|
||||
|
||||
A1 --> B1
|
||||
A2 --> B1
|
||||
A3 --> B1
|
||||
|
||||
B1 --> B2
|
||||
B2 --> B3
|
||||
B3 --> B4
|
||||
|
||||
B2 --> C1
|
||||
B2 --> C2
|
||||
B2 --> C3
|
||||
B2 --> C4
|
||||
|
||||
C1 --> D1
|
||||
C2 --> D1
|
||||
C3 --> D2
|
||||
C4 --> D3
|
||||
|
||||
D1 --> D2
|
||||
D2 --> D3
|
||||
D3 --> D4
|
||||
|
||||
D4 --> E1
|
||||
B4 --> E2
|
||||
B3 --> E3
|
||||
D1 --> E4
|
||||
|
||||
style B1 fill:#e3f2fd
|
||||
style D3 fill:#f3e5f5
|
||||
style E1 fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Complete Discovery-to-Crawl Pipeline
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> Discovery
|
||||
|
||||
Discovery --> SourceSelection: Configure data sources
|
||||
SourceSelection --> Sitemap: source="sitemap"
|
||||
SourceSelection --> CommonCrawl: source="cc"
|
||||
SourceSelection --> Both: source="sitemap+cc"
|
||||
|
||||
Sitemap --> URLCollection
|
||||
CommonCrawl --> URLCollection
|
||||
Both --> URLCollection
|
||||
|
||||
URLCollection --> Filtering: Apply patterns
|
||||
Filtering --> MetadataExtraction: extract_head=True
|
||||
Filtering --> LiveValidation: extract_head=False
|
||||
|
||||
MetadataExtraction --> LiveValidation: live_check=True
|
||||
MetadataExtraction --> RelevanceScoring: live_check=False
|
||||
LiveValidation --> RelevanceScoring
|
||||
|
||||
RelevanceScoring --> ResultRanking: query provided
|
||||
RelevanceScoring --> ResultLimiting: no query
|
||||
|
||||
ResultRanking --> ResultLimiting: apply score_threshold
|
||||
ResultLimiting --> URLSelection: apply max_urls
|
||||
|
||||
URLSelection --> CrawlPreparation: URLs ready
|
||||
CrawlPreparation --> CrawlExecution: AsyncWebCrawler
|
||||
|
||||
CrawlExecution --> StreamProcessing: stream=True
|
||||
CrawlExecution --> BatchProcessing: stream=False
|
||||
|
||||
StreamProcessing --> [*]
|
||||
BatchProcessing --> [*]
|
||||
|
||||
note right of Discovery : 🔍 Smart URL Discovery
|
||||
note right of URLCollection : 📚 Merge & Deduplicate
|
||||
note right of RelevanceScoring : 🎯 BM25 Algorithm
|
||||
note right of CrawlExecution : 🕷️ High-Performance Crawling
|
||||
```
|
||||
|
||||
### Performance Optimization Strategies
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Input Optimization"
|
||||
A1[Smart Source Selection] --> A2[Sitemap First]
|
||||
A2 --> A3[Add CC if Needed]
|
||||
A3 --> A4[Pattern Filtering Early]
|
||||
end
|
||||
|
||||
subgraph "Processing Optimization"
|
||||
B1[Parallel Workers] --> B2[Bounded Queues]
|
||||
B2 --> B3[Rate Limiting]
|
||||
B3 --> B4[Memory Management]
|
||||
B4 --> B5[Lazy Evaluation]
|
||||
end
|
||||
|
||||
subgraph "Output Optimization"
|
||||
C1[Relevance Threshold] --> C2[Max URL Limits]
|
||||
C2 --> C3[Caching Strategy]
|
||||
C3 --> C4[Streaming Results]
|
||||
end
|
||||
|
||||
subgraph "Performance Metrics"
|
||||
D1[URLs/Second: 100-1000]
|
||||
D2[Memory Usage: Bounded]
|
||||
D3[Network Efficiency: 95%+]
|
||||
D4[Cache Hit Rate: 80%+]
|
||||
end
|
||||
|
||||
A4 --> B1
|
||||
B5 --> C1
|
||||
C4 --> D1
|
||||
|
||||
style A2 fill:#e8f5e8
|
||||
style B2 fill:#e3f2fd
|
||||
style C3 fill:#f3e5f5
|
||||
style D3 fill:#c8e6c9
|
||||
```
|
||||
|
||||
### URL Discovery vs Traditional Crawling Comparison
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Traditional Approach"
|
||||
T1[Start URL] --> T2[Crawl Page]
|
||||
T2 --> T3[Extract Links]
|
||||
T3 --> T4[Queue New URLs]
|
||||
T4 --> T2
|
||||
T5[❌ Time: Hours/Days]
|
||||
T6[❌ Resource Heavy]
|
||||
T7[❌ Depth Limited]
|
||||
T8[❌ Discovery Bias]
|
||||
end
|
||||
|
||||
subgraph "URL Seeding Approach"
|
||||
S1[Domain Input] --> S2[Query All Sources]
|
||||
S2 --> S3[Pattern Filter]
|
||||
S3 --> S4[Relevance Score]
|
||||
S4 --> S5[Select Best URLs]
|
||||
S5 --> S6[Ready to Crawl]
|
||||
|
||||
S7[✅ Time: Seconds/Minutes]
|
||||
S8[✅ Resource Efficient]
|
||||
S9[✅ Complete Coverage]
|
||||
S10[✅ Quality Focused]
|
||||
end
|
||||
|
||||
subgraph "Use Case Decision Matrix"
|
||||
U1[Small Sites < 1000 pages] --> U2[Use Deep Crawling]
|
||||
U3[Large Sites > 10000 pages] --> U4[Use URL Seeding]
|
||||
U5[Unknown Structure] --> U6[Start with Seeding]
|
||||
U7[Real-time Discovery] --> U8[Use Deep Crawling]
|
||||
U9[Quality over Quantity] --> U10[Use URL Seeding]
|
||||
end
|
||||
|
||||
style S6 fill:#c8e6c9
|
||||
style S7 fill:#c8e6c9
|
||||
style S8 fill:#c8e6c9
|
||||
style S9 fill:#c8e6c9
|
||||
style S10 fill:#c8e6c9
|
||||
style T5 fill:#ffcdd2
|
||||
style T6 fill:#ffcdd2
|
||||
style T7 fill:#ffcdd2
|
||||
style T8 fill:#ffcdd2
|
||||
```
|
||||
|
||||
### Data Source Characteristics and Selection
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Sitemap Source"
|
||||
SM1[📋 Official URL List]
|
||||
SM2[⚡ Fast Response]
|
||||
SM3[📅 Recently Updated]
|
||||
SM4[🎯 High Quality URLs]
|
||||
SM5[❌ May Miss Some Pages]
|
||||
end
|
||||
|
||||
subgraph "Common Crawl Source"
|
||||
CC1[🌐 Comprehensive Coverage]
|
||||
CC2[📚 Historical Data]
|
||||
CC3[🔍 Deep Discovery]
|
||||
CC4[⏳ Slower Response]
|
||||
CC5[🧹 May Include Noise]
|
||||
end
|
||||
|
||||
subgraph "Combined Strategy"
|
||||
CB1[🚀 Best of Both]
|
||||
CB2[📊 Maximum Coverage]
|
||||
CB3[✨ Automatic Deduplication]
|
||||
CB4[⚖️ Balanced Performance]
|
||||
end
|
||||
|
||||
subgraph "Selection Guidelines"
|
||||
G1[Speed Critical → Sitemap Only]
|
||||
G2[Coverage Critical → Common Crawl]
|
||||
G3[Best Quality → Combined]
|
||||
G4[Unknown Domain → Combined]
|
||||
end
|
||||
|
||||
style SM2 fill:#c8e6c9
|
||||
style SM4 fill:#c8e6c9
|
||||
style CC1 fill:#e3f2fd
|
||||
style CC3 fill:#e3f2fd
|
||||
style CB1 fill:#f3e5f5
|
||||
style CB3 fill:#f3e5f5
|
||||
```
|
||||
|
||||
**📖 Learn more:** [URL Seeding Guide](https://docs.crawl4ai.com/core/url-seeding/), [Performance Optimization](https://docs.crawl4ai.com/advanced/optimization/), [Multi-URL Crawling](https://docs.crawl4ai.com/advanced/multi-url-crawling/)
|
||||
295
docs/md_v2/assets/llm.txt/txt/cli.txt
Normal file
@@ -0,0 +1,295 @@
|
||||
## CLI & Identity-Based Browsing
|
||||
|
||||
Command-line interface for web crawling with persistent browser profiles, authentication, and identity management.
|
||||
|
||||
### Basic CLI Usage
|
||||
|
||||
```bash
|
||||
# Simple crawling
|
||||
crwl https://example.com
|
||||
|
||||
# Get markdown output
|
||||
crwl https://example.com -o markdown
|
||||
|
||||
# JSON output with cache bypass
|
||||
crwl https://example.com -o json --bypass-cache
|
||||
|
||||
# Verbose mode with specific browser settings
|
||||
crwl https://example.com -b "headless=false,viewport_width=1280" -v
|
||||
```
|
||||
|
||||
### Profile Management Commands
|
||||
|
||||
```bash
|
||||
# Launch interactive profile manager
|
||||
crwl profiles
|
||||
|
||||
# Create, list, and manage browser profiles
|
||||
# This opens a menu where you can:
|
||||
# 1. List existing profiles
|
||||
# 2. Create new profile (opens browser for setup)
|
||||
# 3. Delete profiles
|
||||
# 4. Use profile to crawl a website
|
||||
|
||||
# Use a specific profile for crawling
|
||||
crwl https://example.com -p my-profile-name
|
||||
|
||||
# Example workflow for authenticated sites:
|
||||
# 1. Create profile and log in
|
||||
crwl profiles # Select "Create new profile"
|
||||
# 2. Use profile for crawling authenticated content
|
||||
crwl https://site-requiring-login.com/dashboard -p my-profile-name
|
||||
```
|
||||
|
||||
### CDP Browser Management
|
||||
|
||||
```bash
|
||||
# Launch browser with CDP debugging (default port 9222)
|
||||
crwl cdp
|
||||
|
||||
# Use specific profile and custom port
|
||||
crwl cdp -p my-profile -P 9223
|
||||
|
||||
# Launch headless browser with CDP
|
||||
crwl cdp --headless
|
||||
|
||||
# Launch in incognito mode (ignores profile)
|
||||
crwl cdp --incognito
|
||||
|
||||
# Use custom user data directory
|
||||
crwl cdp --user-data-dir ~/my-browser-data --port 9224
|
||||
```
|
||||
|
||||
### Builtin Browser Management
|
||||
|
||||
```bash
|
||||
# Start persistent browser instance
|
||||
crwl browser start
|
||||
|
||||
# Check browser status
|
||||
crwl browser status
|
||||
|
||||
# Open visible window to see the browser
|
||||
crwl browser view --url https://example.com
|
||||
|
||||
# Stop the browser
|
||||
crwl browser stop
|
||||
|
||||
# Restart with different options
|
||||
crwl browser restart --browser-type chromium --port 9223 --no-headless
|
||||
|
||||
# Use builtin browser in crawling
|
||||
crwl https://example.com -b "browser_mode=builtin"
|
||||
```
|
||||
|
||||
### Authentication Workflow Examples
|
||||
|
||||
```bash
|
||||
# Complete workflow for LinkedIn scraping
|
||||
# 1. Create authenticated profile
|
||||
crwl profiles
|
||||
# Select "Create new profile" → login to LinkedIn in browser → press 'q' to save
|
||||
|
||||
# 2. Use profile for crawling
|
||||
crwl https://linkedin.com/in/someone -p linkedin-profile -o markdown
|
||||
|
||||
# 3. Extract structured data with authentication
|
||||
crwl https://linkedin.com/search/results/people/ \
|
||||
-p linkedin-profile \
|
||||
-j "Extract people profiles with names, titles, and companies" \
|
||||
-b "headless=false"
|
||||
|
||||
# GitHub authenticated crawling
|
||||
crwl profiles # Create github-profile
|
||||
crwl https://github.com/settings/profile -p github-profile
|
||||
|
||||
# Twitter/X authenticated access
|
||||
crwl profiles # Create twitter-profile
|
||||
crwl https://twitter.com/home -p twitter-profile -o markdown
|
||||
```
|
||||
|
||||
### Advanced CLI Configuration
|
||||
|
||||
```bash
|
||||
# Complex crawling with multiple configs
|
||||
crwl https://example.com \
|
||||
-B browser.yml \
|
||||
-C crawler.yml \
|
||||
-e extract_llm.yml \
|
||||
-s llm_schema.json \
|
||||
-p my-auth-profile \
|
||||
-o json \
|
||||
-v
|
||||
|
||||
# Quick LLM extraction with authentication
|
||||
crwl https://private-site.com/dashboard \
|
||||
-p auth-profile \
|
||||
-j "Extract user dashboard data including metrics and notifications" \
|
||||
-b "headless=true,viewport_width=1920"
|
||||
|
||||
# Content filtering with authentication
|
||||
crwl https://members-only-site.com \
|
||||
-p member-profile \
|
||||
-f filter_bm25.yml \
|
||||
-c "css_selector=.member-content,scan_full_page=true" \
|
||||
-o markdown-fit
|
||||
```
|
||||
|
||||
### Configuration Files for Identity Browsing
|
||||
|
||||
```yaml
|
||||
# browser_auth.yml
|
||||
headless: false
|
||||
use_managed_browser: true
|
||||
user_data_dir: "/path/to/profile"
|
||||
viewport_width: 1280
|
||||
viewport_height: 720
|
||||
simulate_user: true
|
||||
override_navigator: true
|
||||
|
||||
# crawler_auth.yml
|
||||
magic: true
|
||||
remove_overlay_elements: true
|
||||
simulate_user: true
|
||||
wait_for: "css:.authenticated-content"
|
||||
page_timeout: 60000
|
||||
delay_before_return_html: 2
|
||||
scan_full_page: true
|
||||
```
|
||||
|
||||
### Global Configuration Management
|
||||
|
||||
```bash
|
||||
# List all configuration settings
|
||||
crwl config list
|
||||
|
||||
# Set default LLM provider
|
||||
crwl config set DEFAULT_LLM_PROVIDER "anthropic/claude-3-sonnet"
|
||||
crwl config set DEFAULT_LLM_PROVIDER_TOKEN "your-api-token"
|
||||
|
||||
# Set browser defaults
|
||||
crwl config set BROWSER_HEADLESS false # Always show browser
|
||||
crwl config set USER_AGENT_MODE random # Random user agents
|
||||
|
||||
# Enable verbose mode globally
|
||||
crwl config set VERBOSE true
|
||||
```
|
||||
|
||||
### Q&A with Authenticated Content
|
||||
|
||||
```bash
|
||||
# Ask questions about authenticated content
|
||||
crwl https://private-dashboard.com -p dashboard-profile \
|
||||
-q "What are the key metrics shown in my dashboard?"
|
||||
|
||||
# Multiple questions workflow
|
||||
crwl https://company-intranet.com -p work-profile -o markdown # View content
|
||||
crwl https://company-intranet.com -p work-profile \
|
||||
-q "Summarize this week's announcements"
|
||||
crwl https://company-intranet.com -p work-profile \
|
||||
-q "What are the upcoming deadlines?"
|
||||
```
|
||||
|
||||
### Profile Creation Programmatically
|
||||
|
||||
```python
|
||||
# Create profiles via Python API
|
||||
import asyncio
|
||||
from crawl4ai import BrowserProfiler
|
||||
|
||||
async def create_auth_profile():
|
||||
profiler = BrowserProfiler()
|
||||
|
||||
# Create profile interactively (opens browser)
|
||||
profile_path = await profiler.create_profile("linkedin-auth")
|
||||
print(f"Profile created at: {profile_path}")
|
||||
|
||||
# List all profiles
|
||||
profiles = profiler.list_profiles()
|
||||
for profile in profiles:
|
||||
print(f"Profile: {profile['name']} at {profile['path']}")
|
||||
|
||||
# Use profile for crawling
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
use_managed_browser=True,
|
||||
user_data_dir=profile_path
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://linkedin.com/feed")
|
||||
return result
|
||||
|
||||
# asyncio.run(create_auth_profile())
|
||||
```
|
||||
|
||||
### Identity Browsing Best Practices
|
||||
|
||||
```bash
|
||||
# 1. Create specific profiles for different sites
|
||||
crwl profiles # Create "linkedin-work"
|
||||
crwl profiles # Create "github-personal"
|
||||
crwl profiles # Create "company-intranet"
|
||||
|
||||
# 2. Use descriptive profile names
|
||||
crwl https://site1.com -p site1-admin-account
|
||||
crwl https://site2.com -p site2-user-account
|
||||
|
||||
# 3. Combine with appropriate browser settings
|
||||
crwl https://secure-site.com \
|
||||
-p secure-profile \
|
||||
-b "headless=false,simulate_user=true,magic=true" \
|
||||
-c "wait_for=.logged-in-indicator,page_timeout=30000"
|
||||
|
||||
# 4. Test profile before automated crawling
|
||||
crwl cdp -p test-profile # Manually verify login status
|
||||
crwl https://test-url.com -p test-profile -v # Verbose test crawl
|
||||
```
|
||||
|
||||
### Troubleshooting Authentication Issues
|
||||
|
||||
```bash
|
||||
# Debug authentication problems
|
||||
crwl https://auth-site.com -p auth-profile \
|
||||
-b "headless=false,verbose=true" \
|
||||
-c "verbose=true,page_timeout=60000" \
|
||||
-v
|
||||
|
||||
# Check profile status
|
||||
crwl profiles # List profiles and check creation dates
|
||||
|
||||
# Recreate problematic profiles
|
||||
crwl profiles # Delete old profile, create new one
|
||||
|
||||
# Test with visible browser
|
||||
crwl https://problem-site.com -p profile-name \
|
||||
-b "headless=false" \
|
||||
-c "delay_before_return_html=5"
|
||||
```
|
||||
|
||||
### Common Use Cases
|
||||
|
||||
```bash
|
||||
# Social media monitoring (after authentication)
|
||||
crwl https://twitter.com/home -p twitter-monitor \
|
||||
-j "Extract latest tweets with sentiment and engagement metrics"
|
||||
|
||||
# E-commerce competitor analysis (with account access)
|
||||
crwl https://competitor-site.com/products -p competitor-account \
|
||||
-j "Extract product prices, availability, and descriptions"
|
||||
|
||||
# Company dashboard monitoring
|
||||
crwl https://company-dashboard.com -p work-profile \
|
||||
-c "css_selector=.dashboard-content" \
|
||||
-q "What alerts or notifications need attention?"
|
||||
|
||||
# Research data collection (authenticated access)
|
||||
crwl https://research-platform.com/data -p research-profile \
|
||||
-e extract_research.yml \
|
||||
-s research_schema.json \
|
||||
-o json
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Identity-Based Crawling Documentation](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Browser Profile Management](https://docs.crawl4ai.com/advanced/session-management/), [CLI Examples](https://docs.crawl4ai.com/core/cli/)
|
||||
1171
docs/md_v2/assets/llm.txt/txt/config_objects.txt
Normal file
@@ -0,0 +1,446 @@
|
||||
## Deep Crawling Filters & Scorers
|
||||
|
||||
Advanced URL filtering and scoring strategies for intelligent deep crawling with performance optimization.
|
||||
|
||||
### URL Filters - Content and Domain Control
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.filters import (
|
||||
URLPatternFilter, DomainFilter, ContentTypeFilter,
|
||||
FilterChain, ContentRelevanceFilter, SEOFilter
|
||||
)
|
||||
|
||||
# Pattern-based filtering
|
||||
pattern_filter = URLPatternFilter(
|
||||
patterns=[
|
||||
"*.html", # HTML pages only
|
||||
"*/blog/*", # Blog posts
|
||||
"*/articles/*", # Article pages
|
||||
"*2024*", # Recent content
|
||||
"^https://example.com/docs/.*" # Regex pattern
|
||||
],
|
||||
use_glob=True,
|
||||
reverse=False # False = include matching, True = exclude matching
|
||||
)
|
||||
|
||||
# Domain filtering with subdomains
|
||||
domain_filter = DomainFilter(
|
||||
allowed_domains=["example.com", "docs.example.com"],
|
||||
blocked_domains=["ads.example.com", "tracker.com"]
|
||||
)
|
||||
|
||||
# Content type filtering
|
||||
content_filter = ContentTypeFilter(
|
||||
allowed_types=["text/html", "application/pdf"],
|
||||
check_extension=True
|
||||
)
|
||||
|
||||
# Apply individual filters
|
||||
url = "https://example.com/blog/2024/article.html"
|
||||
print(f"Pattern filter: {pattern_filter.apply(url)}")
|
||||
print(f"Domain filter: {domain_filter.apply(url)}")
|
||||
print(f"Content filter: {content_filter.apply(url)}")
|
||||
```
|
||||
|
||||
### Filter Chaining - Combine Multiple Filters
|
||||
|
||||
```python
|
||||
# Create filter chain for comprehensive filtering
|
||||
filter_chain = FilterChain([
|
||||
DomainFilter(allowed_domains=["example.com"]),
|
||||
URLPatternFilter(patterns=["*/blog/*", "*/docs/*"]),
|
||||
ContentTypeFilter(allowed_types=["text/html"])
|
||||
])
|
||||
|
||||
# Apply chain to URLs
|
||||
urls = [
|
||||
"https://example.com/blog/post1.html",
|
||||
"https://spam.com/content.html",
|
||||
"https://example.com/blog/image.jpg",
|
||||
"https://example.com/docs/guide.html"
|
||||
]
|
||||
|
||||
async def filter_urls(urls, filter_chain):
|
||||
filtered = []
|
||||
for url in urls:
|
||||
if await filter_chain.apply(url):
|
||||
filtered.append(url)
|
||||
return filtered
|
||||
|
||||
# Usage
|
||||
filtered_urls = await filter_urls(urls, filter_chain)
|
||||
print(f"Filtered URLs: {filtered_urls}")
|
||||
|
||||
# Check filter statistics
|
||||
for filter_obj in filter_chain.filters:
|
||||
stats = filter_obj.stats
|
||||
print(f"{filter_obj.name}: {stats.passed_urls}/{stats.total_urls} passed")
|
||||
```
|
||||
|
||||
### Advanced Content Filters
|
||||
|
||||
```python
|
||||
# BM25-based content relevance filtering
|
||||
relevance_filter = ContentRelevanceFilter(
|
||||
query="python machine learning tutorial",
|
||||
threshold=0.5, # Minimum relevance score
|
||||
k1=1.2, # TF saturation parameter
|
||||
b=0.75, # Length normalization
|
||||
avgdl=1000 # Average document length
|
||||
)
|
||||
|
||||
# SEO quality filtering
|
||||
seo_filter = SEOFilter(
|
||||
threshold=0.65, # Minimum SEO score
|
||||
keywords=["python", "tutorial", "guide"],
|
||||
weights={
|
||||
"title_length": 0.15,
|
||||
"title_kw": 0.18,
|
||||
"meta_description": 0.12,
|
||||
"canonical": 0.10,
|
||||
"robot_ok": 0.20,
|
||||
"schema_org": 0.10,
|
||||
"url_quality": 0.15
|
||||
}
|
||||
)
|
||||
|
||||
# Apply advanced filters
|
||||
url = "https://example.com/python-ml-tutorial"
|
||||
relevance_score = await relevance_filter.apply(url)
|
||||
seo_score = await seo_filter.apply(url)
|
||||
|
||||
print(f"Relevance: {relevance_score}, SEO: {seo_score}")
|
||||
```
|
||||
|
||||
### URL Scorers - Quality and Relevance Scoring
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.scorers import (
|
||||
KeywordRelevanceScorer, PathDepthScorer, ContentTypeScorer,
|
||||
FreshnessScorer, DomainAuthorityScorer, CompositeScorer
|
||||
)
|
||||
|
||||
# Keyword relevance scoring
|
||||
keyword_scorer = KeywordRelevanceScorer(
|
||||
keywords=["python", "tutorial", "guide", "machine", "learning"],
|
||||
weight=1.0,
|
||||
case_sensitive=False
|
||||
)
|
||||
|
||||
# Path depth scoring (optimal depth = 3)
|
||||
depth_scorer = PathDepthScorer(
|
||||
optimal_depth=3, # /category/subcategory/article
|
||||
weight=0.8
|
||||
)
|
||||
|
||||
# Content type scoring
|
||||
content_type_scorer = ContentTypeScorer(
|
||||
type_weights={
|
||||
"html": 1.0, # Highest priority
|
||||
"pdf": 0.8, # Medium priority
|
||||
"txt": 0.6, # Lower priority
|
||||
"doc": 0.4 # Lowest priority
|
||||
},
|
||||
weight=0.9
|
||||
)
|
||||
|
||||
# Freshness scoring
|
||||
freshness_scorer = FreshnessScorer(
|
||||
weight=0.7,
|
||||
current_year=2024
|
||||
)
|
||||
|
||||
# Domain authority scoring
|
||||
domain_scorer = DomainAuthorityScorer(
|
||||
domain_weights={
|
||||
"python.org": 1.0,
|
||||
"github.com": 0.9,
|
||||
"stackoverflow.com": 0.85,
|
||||
"medium.com": 0.7,
|
||||
"personal-blog.com": 0.3
|
||||
},
|
||||
default_weight=0.5,
|
||||
weight=1.0
|
||||
)
|
||||
|
||||
# Score individual URLs
|
||||
url = "https://python.org/tutorial/2024/machine-learning.html"
|
||||
scores = {
|
||||
"keyword": keyword_scorer.score(url),
|
||||
"depth": depth_scorer.score(url),
|
||||
"content": content_type_scorer.score(url),
|
||||
"freshness": freshness_scorer.score(url),
|
||||
"domain": domain_scorer.score(url)
|
||||
}
|
||||
|
||||
print(f"Individual scores: {scores}")
|
||||
```
|
||||
|
||||
### Composite Scoring - Combine Multiple Scorers
|
||||
|
||||
```python
|
||||
# Create composite scorer combining all strategies
|
||||
composite_scorer = CompositeScorer(
|
||||
scorers=[
|
||||
KeywordRelevanceScorer(["python", "tutorial"], weight=1.5),
|
||||
PathDepthScorer(optimal_depth=3, weight=1.0),
|
||||
ContentTypeScorer({"html": 1.0, "pdf": 0.8}, weight=1.2),
|
||||
FreshnessScorer(weight=0.8, current_year=2024),
|
||||
DomainAuthorityScorer({
|
||||
"python.org": 1.0,
|
||||
"github.com": 0.9
|
||||
}, weight=1.3)
|
||||
],
|
||||
normalize=True # Normalize by number of scorers
|
||||
)
|
||||
|
||||
# Score multiple URLs
|
||||
urls_to_score = [
|
||||
"https://python.org/tutorial/2024/basics.html",
|
||||
"https://github.com/user/python-guide/blob/main/README.md",
|
||||
"https://random-blog.com/old/2018/python-stuff.html",
|
||||
"https://python.org/docs/deep/nested/advanced/guide.html"
|
||||
]
|
||||
|
||||
scored_urls = []
|
||||
for url in urls_to_score:
|
||||
score = composite_scorer.score(url)
|
||||
scored_urls.append((url, score))
|
||||
|
||||
# Sort by score (highest first)
|
||||
scored_urls.sort(key=lambda x: x[1], reverse=True)
|
||||
|
||||
for url, score in scored_urls:
|
||||
print(f"Score: {score:.3f} - {url}")
|
||||
|
||||
# Check scorer statistics
|
||||
print(f"\nScoring statistics:")
|
||||
print(f"URLs scored: {composite_scorer.stats._urls_scored}")
|
||||
print(f"Average score: {composite_scorer.stats.get_average():.3f}")
|
||||
```
|
||||
|
||||
### Advanced Filter Patterns
|
||||
|
||||
```python
|
||||
# Complex pattern matching
|
||||
advanced_patterns = URLPatternFilter(
|
||||
patterns=[
|
||||
r"^https://docs\.python\.org/\d+/", # Python docs with version
|
||||
r".*/tutorial/.*\.html$", # Tutorial pages
|
||||
r".*/guide/(?!deprecated).*", # Guides but not deprecated
|
||||
"*/blog/{2020,2021,2022,2023,2024}/*", # Recent blog posts
|
||||
"**/{api,reference}/**/*.html" # API/reference docs
|
||||
],
|
||||
use_glob=True
|
||||
)
|
||||
|
||||
# Exclude patterns (reverse=True)
|
||||
exclude_filter = URLPatternFilter(
|
||||
patterns=[
|
||||
"*/admin/*",
|
||||
"*/login/*",
|
||||
"*/private/*",
|
||||
"**/.*", # Hidden files
|
||||
"*.{jpg,png,gif,css,js}$" # Media and assets
|
||||
],
|
||||
reverse=True # Exclude matching patterns
|
||||
)
|
||||
|
||||
# Content type with extension mapping
|
||||
detailed_content_filter = ContentTypeFilter(
|
||||
allowed_types=["text", "application"],
|
||||
check_extension=True,
|
||||
ext_map={
|
||||
"html": "text/html",
|
||||
"htm": "text/html",
|
||||
"md": "text/markdown",
|
||||
"pdf": "application/pdf",
|
||||
"doc": "application/msword",
|
||||
"docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### Performance-Optimized Filtering
|
||||
|
||||
```python
|
||||
# High-performance filter chain for large-scale crawling
|
||||
class OptimizedFilterChain:
|
||||
def __init__(self):
|
||||
# Fast filters first (domain, patterns)
|
||||
self.fast_filters = [
|
||||
DomainFilter(
|
||||
allowed_domains=["example.com", "docs.example.com"],
|
||||
blocked_domains=["ads.example.com"]
|
||||
),
|
||||
URLPatternFilter([
|
||||
"*.html", "*.pdf", "*/blog/*", "*/docs/*"
|
||||
])
|
||||
]
|
||||
|
||||
# Slower filters last (content analysis)
|
||||
self.slow_filters = [
|
||||
ContentRelevanceFilter(
|
||||
query="important content",
|
||||
threshold=0.3
|
||||
)
|
||||
]
|
||||
|
||||
async def apply_optimized(self, url: str) -> bool:
|
||||
# Apply fast filters first
|
||||
for filter_obj in self.fast_filters:
|
||||
if not filter_obj.apply(url):
|
||||
return False
|
||||
|
||||
# Only apply slow filters if fast filters pass
|
||||
for filter_obj in self.slow_filters:
|
||||
if not await filter_obj.apply(url):
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
# Batch filtering with concurrency
|
||||
async def batch_filter_urls(urls, filter_chain, max_concurrent=50):
|
||||
import asyncio
|
||||
semaphore = asyncio.Semaphore(max_concurrent)
|
||||
|
||||
async def filter_single(url):
|
||||
async with semaphore:
|
||||
return await filter_chain.apply(url), url
|
||||
|
||||
tasks = [filter_single(url) for url in urls]
|
||||
results = await asyncio.gather(*tasks)
|
||||
|
||||
return [url for passed, url in results if passed]
|
||||
|
||||
# Usage with 1000 URLs
|
||||
large_url_list = [f"https://example.com/page{i}.html" for i in range(1000)]
|
||||
optimized_chain = OptimizedFilterChain()
|
||||
filtered = await batch_filter_urls(large_url_list, optimized_chain)
|
||||
```
|
||||
|
||||
### Custom Filter Implementation
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.filters import URLFilter
|
||||
import re
|
||||
|
||||
class CustomLanguageFilter(URLFilter):
|
||||
"""Filter URLs by language indicators"""
|
||||
|
||||
def __init__(self, allowed_languages=["en"], weight=1.0):
|
||||
super().__init__()
|
||||
self.allowed_languages = set(allowed_languages)
|
||||
self.lang_patterns = {
|
||||
"en": re.compile(r"/en/|/english/|lang=en"),
|
||||
"es": re.compile(r"/es/|/spanish/|lang=es"),
|
||||
"fr": re.compile(r"/fr/|/french/|lang=fr"),
|
||||
"de": re.compile(r"/de/|/german/|lang=de")
|
||||
}
|
||||
|
||||
def apply(self, url: str) -> bool:
|
||||
# Default to English if no language indicators
|
||||
if not any(pattern.search(url) for pattern in self.lang_patterns.values()):
|
||||
result = "en" in self.allowed_languages
|
||||
self._update_stats(result)
|
||||
return result
|
||||
|
||||
# Check for allowed languages
|
||||
for lang in self.allowed_languages:
|
||||
if lang in self.lang_patterns:
|
||||
if self.lang_patterns[lang].search(url):
|
||||
self._update_stats(True)
|
||||
return True
|
||||
|
||||
self._update_stats(False)
|
||||
return False
|
||||
|
||||
# Custom scorer implementation
|
||||
from crawl4ai.deep_crawling.scorers import URLScorer
|
||||
|
||||
class CustomComplexityScorer(URLScorer):
|
||||
"""Score URLs by content complexity indicators"""
|
||||
|
||||
def __init__(self, weight=1.0):
|
||||
super().__init__(weight)
|
||||
self.complexity_indicators = {
|
||||
"tutorial": 0.9,
|
||||
"guide": 0.8,
|
||||
"example": 0.7,
|
||||
"reference": 0.6,
|
||||
"api": 0.5
|
||||
}
|
||||
|
||||
def _calculate_score(self, url: str) -> float:
|
||||
url_lower = url.lower()
|
||||
max_score = 0.0
|
||||
|
||||
for indicator, score in self.complexity_indicators.items():
|
||||
if indicator in url_lower:
|
||||
max_score = max(max_score, score)
|
||||
|
||||
return max_score
|
||||
|
||||
# Use custom filters and scorers
|
||||
custom_filter = CustomLanguageFilter(allowed_languages=["en", "es"])
|
||||
custom_scorer = CustomComplexityScorer(weight=1.2)
|
||||
|
||||
url = "https://example.com/en/tutorial/advanced-guide.html"
|
||||
passes_filter = custom_filter.apply(url)
|
||||
complexity_score = custom_scorer.score(url)
|
||||
|
||||
print(f"Passes language filter: {passes_filter}")
|
||||
print(f"Complexity score: {complexity_score}")
|
||||
```
|
||||
|
||||
### Integration with Deep Crawling
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.deep_crawling import DeepCrawlStrategy
|
||||
|
||||
async def deep_crawl_with_filtering():
|
||||
# Create comprehensive filter chain
|
||||
filter_chain = FilterChain([
|
||||
DomainFilter(allowed_domains=["python.org"]),
|
||||
URLPatternFilter(["*/tutorial/*", "*/guide/*", "*/docs/*"]),
|
||||
ContentTypeFilter(["text/html"]),
|
||||
SEOFilter(threshold=0.6, keywords=["python", "programming"])
|
||||
])
|
||||
|
||||
# Create composite scorer
|
||||
scorer = CompositeScorer([
|
||||
KeywordRelevanceScorer(["python", "tutorial"], weight=1.5),
|
||||
FreshnessScorer(weight=0.8),
|
||||
PathDepthScorer(optimal_depth=3, weight=1.0)
|
||||
], normalize=True)
|
||||
|
||||
# Configure deep crawl strategy with filters and scorers
|
||||
deep_strategy = DeepCrawlStrategy(
|
||||
max_depth=3,
|
||||
max_pages=100,
|
||||
url_filter=filter_chain,
|
||||
url_scorer=scorer,
|
||||
score_threshold=0.6 # Only crawl URLs scoring above 0.6
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=deep_strategy,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://python.org",
|
||||
config=config
|
||||
)
|
||||
|
||||
print(f"Deep crawl completed: {result.success}")
|
||||
if hasattr(result, 'deep_crawl_results'):
|
||||
print(f"Pages crawled: {len(result.deep_crawl_results)}")
|
||||
|
||||
# Run the deep crawl
|
||||
await deep_crawl_with_filtering()
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Deep Crawling Strategy](https://docs.crawl4ai.com/core/deep-crawling/), [Custom Filter Development](https://docs.crawl4ai.com/advanced/custom-filters/), [Performance Optimization](https://docs.crawl4ai.com/advanced/performance-tuning/)
|
||||
348
docs/md_v2/assets/llm.txt/txt/deep_crawling.txt
Normal file
@@ -0,0 +1,348 @@
|
||||
## Deep Crawling
|
||||
|
||||
Multi-level website exploration with intelligent filtering, scoring, and prioritization strategies.
|
||||
|
||||
### Basic Deep Crawl Setup
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
|
||||
|
||||
# Basic breadth-first deep crawling
|
||||
async def basic_deep_crawl():
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2, # Initial page + 2 levels
|
||||
include_external=False # Stay within same domain
|
||||
),
|
||||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun("https://docs.crawl4ai.com", config=config)
|
||||
|
||||
# Group results by depth
|
||||
pages_by_depth = {}
|
||||
for result in results:
|
||||
depth = result.metadata.get("depth", 0)
|
||||
if depth not in pages_by_depth:
|
||||
pages_by_depth[depth] = []
|
||||
pages_by_depth[depth].append(result.url)
|
||||
|
||||
print(f"Crawled {len(results)} pages total")
|
||||
for depth, urls in sorted(pages_by_depth.items()):
|
||||
print(f"Depth {depth}: {len(urls)} pages")
|
||||
```
|
||||
|
||||
### Deep Crawl Strategies
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy, DFSDeepCrawlStrategy, BestFirstCrawlingStrategy
|
||||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||||
|
||||
# Breadth-First Search - explores all links at one depth before going deeper
|
||||
bfs_strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
max_pages=50, # Limit total pages
|
||||
score_threshold=0.3 # Minimum score for URLs
|
||||
)
|
||||
|
||||
# Depth-First Search - explores as deep as possible before backtracking
|
||||
dfs_strategy = DFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
max_pages=30,
|
||||
score_threshold=0.5
|
||||
)
|
||||
|
||||
# Best-First - prioritizes highest scoring pages (recommended)
|
||||
keyword_scorer = KeywordRelevanceScorer(
|
||||
keywords=["crawl", "example", "async", "configuration"],
|
||||
weight=0.7
|
||||
)
|
||||
|
||||
best_first_strategy = BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
url_scorer=keyword_scorer,
|
||||
max_pages=25 # No score_threshold needed - naturally prioritizes
|
||||
)
|
||||
|
||||
# Usage
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=best_first_strategy, # Choose your strategy
|
||||
scraping_strategy=LXMLWebScrapingStrategy()
|
||||
)
|
||||
```
|
||||
|
||||
### Streaming vs Batch Processing
|
||||
|
||||
```python
|
||||
# Batch mode - wait for all results
|
||||
async def batch_deep_crawl():
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
|
||||
stream=False # Default - collect all results first
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun("https://example.com", config=config)
|
||||
|
||||
# Process all results at once
|
||||
for result in results:
|
||||
print(f"Batch processed: {result.url}")
|
||||
|
||||
# Streaming mode - process results as they arrive
|
||||
async def streaming_deep_crawl():
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
|
||||
stream=True # Process results immediately
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun("https://example.com", config=config):
|
||||
depth = result.metadata.get("depth", 0)
|
||||
print(f"Stream processed depth {depth}: {result.url}")
|
||||
```
|
||||
|
||||
### Filtering with Filter Chains
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.filters import (
|
||||
FilterChain,
|
||||
URLPatternFilter,
|
||||
DomainFilter,
|
||||
ContentTypeFilter,
|
||||
SEOFilter,
|
||||
ContentRelevanceFilter
|
||||
)
|
||||
|
||||
# Single URL pattern filter
|
||||
url_filter = URLPatternFilter(patterns=["*core*", "*guide*"])
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
filter_chain=FilterChain([url_filter])
|
||||
)
|
||||
)
|
||||
|
||||
# Multiple filters in chain
|
||||
advanced_filter_chain = FilterChain([
|
||||
# Domain filtering
|
||||
DomainFilter(
|
||||
allowed_domains=["docs.example.com"],
|
||||
blocked_domains=["old.docs.example.com", "staging.example.com"]
|
||||
),
|
||||
|
||||
# URL pattern matching
|
||||
URLPatternFilter(patterns=["*tutorial*", "*guide*", "*blog*"]),
|
||||
|
||||
# Content type filtering
|
||||
ContentTypeFilter(allowed_types=["text/html"]),
|
||||
|
||||
# SEO quality filter
|
||||
SEOFilter(
|
||||
threshold=0.5,
|
||||
keywords=["tutorial", "guide", "documentation"]
|
||||
),
|
||||
|
||||
# Content relevance filter
|
||||
ContentRelevanceFilter(
|
||||
query="Web crawling and data extraction with Python",
|
||||
threshold=0.7
|
||||
)
|
||||
])
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
filter_chain=advanced_filter_chain
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Intelligent Crawling with Scorers
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||||
|
||||
# Keyword relevance scoring
|
||||
async def scored_deep_crawl():
|
||||
keyword_scorer = KeywordRelevanceScorer(
|
||||
keywords=["browser", "crawler", "web", "automation"],
|
||||
weight=1.0
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
url_scorer=keyword_scorer
|
||||
),
|
||||
stream=True, # Recommended with BestFirst
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
|
||||
score = result.metadata.get("score", 0)
|
||||
depth = result.metadata.get("depth", 0)
|
||||
print(f"Depth: {depth} | Score: {score:.2f} | {result.url}")
|
||||
```
|
||||
|
||||
### Limiting Crawl Size
|
||||
|
||||
```python
|
||||
# Max pages limitation across strategies
|
||||
async def limited_crawls():
|
||||
# BFS with page limit
|
||||
bfs_config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
max_pages=5, # Only crawl 5 pages total
|
||||
url_scorer=KeywordRelevanceScorer(keywords=["browser", "crawler"], weight=1.0)
|
||||
)
|
||||
)
|
||||
|
||||
# DFS with score threshold
|
||||
dfs_config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=DFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
score_threshold=0.7, # Only URLs with scores above 0.7
|
||||
max_pages=10,
|
||||
url_scorer=KeywordRelevanceScorer(keywords=["web", "automation"], weight=1.0)
|
||||
)
|
||||
)
|
||||
|
||||
# Best-First with both constraints
|
||||
bf_config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
max_pages=7, # Automatically gets highest scored pages
|
||||
url_scorer=KeywordRelevanceScorer(keywords=["crawl", "example"], weight=1.0)
|
||||
),
|
||||
stream=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Use any of the configs
|
||||
async for result in await crawler.arun("https://docs.crawl4ai.com", config=bf_config):
|
||||
score = result.metadata.get("score", 0)
|
||||
print(f"Score: {score:.2f} | {result.url}")
|
||||
```
|
||||
|
||||
### Complete Advanced Deep Crawler
|
||||
|
||||
```python
|
||||
async def comprehensive_deep_crawl():
|
||||
# Sophisticated filter chain
|
||||
filter_chain = FilterChain([
|
||||
DomainFilter(
|
||||
allowed_domains=["docs.crawl4ai.com"],
|
||||
blocked_domains=["old.docs.crawl4ai.com"]
|
||||
),
|
||||
URLPatternFilter(patterns=["*core*", "*advanced*", "*blog*"]),
|
||||
ContentTypeFilter(allowed_types=["text/html"]),
|
||||
SEOFilter(threshold=0.4, keywords=["crawl", "tutorial", "guide"])
|
||||
])
|
||||
|
||||
# Multi-keyword scorer
|
||||
keyword_scorer = KeywordRelevanceScorer(
|
||||
keywords=["crawl", "example", "async", "configuration", "browser"],
|
||||
weight=0.8
|
||||
)
|
||||
|
||||
# Complete configuration
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
filter_chain=filter_chain,
|
||||
url_scorer=keyword_scorer,
|
||||
max_pages=20
|
||||
),
|
||||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||||
stream=True,
|
||||
verbose=True,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
# Execute and analyze
|
||||
results = []
|
||||
start_time = time.time()
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
|
||||
results.append(result)
|
||||
score = result.metadata.get("score", 0)
|
||||
depth = result.metadata.get("depth", 0)
|
||||
print(f"→ Depth: {depth} | Score: {score:.2f} | {result.url}")
|
||||
|
||||
# Performance analysis
|
||||
duration = time.time() - start_time
|
||||
avg_score = sum(r.metadata.get('score', 0) for r in results) / len(results)
|
||||
|
||||
print(f"✅ Crawled {len(results)} pages in {duration:.2f}s")
|
||||
print(f"✅ Average relevance score: {avg_score:.2f}")
|
||||
|
||||
# Depth distribution
|
||||
depth_counts = {}
|
||||
for result in results:
|
||||
depth = result.metadata.get("depth", 0)
|
||||
depth_counts[depth] = depth_counts.get(depth, 0) + 1
|
||||
|
||||
for depth, count in sorted(depth_counts.items()):
|
||||
print(f"📊 Depth {depth}: {count} pages")
|
||||
```
|
||||
|
||||
### Error Handling and Robustness
|
||||
|
||||
```python
|
||||
async def robust_deep_crawl():
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
max_pages=15,
|
||||
url_scorer=KeywordRelevanceScorer(keywords=["guide", "tutorial"])
|
||||
),
|
||||
stream=True,
|
||||
page_timeout=30000 # 30 second timeout per page
|
||||
)
|
||||
|
||||
successful_pages = []
|
||||
failed_pages = []
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
|
||||
if result.success:
|
||||
successful_pages.append(result)
|
||||
depth = result.metadata.get("depth", 0)
|
||||
score = result.metadata.get("score", 0)
|
||||
print(f"✅ Depth {depth} | Score: {score:.2f} | {result.url}")
|
||||
else:
|
||||
failed_pages.append({
|
||||
'url': result.url,
|
||||
'error': result.error_message,
|
||||
'depth': result.metadata.get("depth", 0)
|
||||
})
|
||||
print(f"❌ Failed: {result.url} - {result.error_message}")
|
||||
|
||||
print(f"📊 Results: {len(successful_pages)} successful, {len(failed_pages)} failed")
|
||||
|
||||
# Analyze failures by depth
|
||||
if failed_pages:
|
||||
failure_by_depth = {}
|
||||
for failure in failed_pages:
|
||||
depth = failure['depth']
|
||||
failure_by_depth[depth] = failure_by_depth.get(depth, 0) + 1
|
||||
|
||||
print("❌ Failures by depth:")
|
||||
for depth, count in sorted(failure_by_depth.items()):
|
||||
print(f" Depth {depth}: {count} failures")
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Deep Crawling Guide](https://docs.crawl4ai.com/core/deep-crawling/), [Filter Documentation](https://docs.crawl4ai.com/core/content-selection/), [Scoring Strategies](https://docs.crawl4ai.com/advanced/advanced-features/)
|
||||
826
docs/md_v2/assets/llm.txt/txt/docker.txt
Normal file
@@ -0,0 +1,826 @@
|
||||
## Docker Deployment
|
||||
|
||||
Complete Docker deployment guide with pre-built images, API endpoints, configuration, and MCP integration.
|
||||
|
||||
### Quick Start with Pre-built Images
|
||||
|
||||
```bash
|
||||
# Pull latest image
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
|
||||
# Setup LLM API keys
|
||||
cat > .llm.env << EOL
|
||||
OPENAI_API_KEY=sk-your-key
|
||||
ANTHROPIC_API_KEY=your-anthropic-key
|
||||
GROQ_API_KEY=your-groq-key
|
||||
GEMINI_API_TOKEN=your-gemini-token
|
||||
EOL
|
||||
|
||||
# Run with LLM support
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
unclecode/crawl4ai:latest
|
||||
|
||||
# Basic run (no LLM)
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai \
|
||||
--shm-size=1g \
|
||||
unclecode/crawl4ai:latest
|
||||
|
||||
# Check health
|
||||
curl http://localhost:11235/health
|
||||
```
|
||||
|
||||
### Docker Compose Deployment
|
||||
|
||||
```bash
|
||||
# Clone and setup
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
cp deploy/docker/.llm.env.example .llm.env
|
||||
# Edit .llm.env with your API keys
|
||||
|
||||
# Run pre-built image
|
||||
IMAGE=unclecode/crawl4ai:latest docker compose up -d
|
||||
|
||||
# Build locally
|
||||
docker compose up --build -d
|
||||
|
||||
# Build with all features
|
||||
INSTALL_TYPE=all docker compose up --build -d
|
||||
|
||||
# Build with GPU support
|
||||
ENABLE_GPU=true docker compose up --build -d
|
||||
|
||||
# Stop service
|
||||
docker compose down
|
||||
```
|
||||
|
||||
### Manual Build with Multi-Architecture
|
||||
|
||||
```bash
|
||||
# Clone repository
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
|
||||
# Build for current architecture
|
||||
docker buildx build -t crawl4ai-local:latest --load .
|
||||
|
||||
# Build for multiple architectures
|
||||
docker buildx build --platform linux/amd64,linux/arm64 \
|
||||
-t crawl4ai-local:latest --load .
|
||||
|
||||
# Build with specific features
|
||||
docker buildx build \
|
||||
--build-arg INSTALL_TYPE=all \
|
||||
--build-arg ENABLE_GPU=false \
|
||||
-t crawl4ai-local:latest --load .
|
||||
|
||||
# Run custom build
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai-custom \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
crawl4ai-local:latest
|
||||
```
|
||||
|
||||
### Build Arguments
|
||||
|
||||
```bash
|
||||
# Available build options
|
||||
docker buildx build \
|
||||
--build-arg INSTALL_TYPE=all \ # default|all|torch|transformer
|
||||
--build-arg ENABLE_GPU=true \ # true|false
|
||||
--build-arg APP_HOME=/app \ # Install path
|
||||
--build-arg USE_LOCAL=true \ # Use local source
|
||||
--build-arg GITHUB_REPO=url \ # Git repo if USE_LOCAL=false
|
||||
--build-arg GITHUB_BRANCH=main \ # Git branch
|
||||
-t crawl4ai-custom:latest --load .
|
||||
```
|
||||
|
||||
### Core API Endpoints
|
||||
|
||||
```python
|
||||
# Main crawling endpoints
|
||||
import requests
|
||||
import json
|
||||
|
||||
# Basic crawl
|
||||
payload = {
|
||||
"urls": ["https://example.com"],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "bypass"}}
|
||||
}
|
||||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||||
|
||||
# Streaming crawl
|
||||
payload["crawler_config"]["params"]["stream"] = True
|
||||
response = requests.post("http://localhost:11235/crawl/stream", json=payload)
|
||||
|
||||
# Health check
|
||||
response = requests.get("http://localhost:11235/health")
|
||||
|
||||
# API schema
|
||||
response = requests.get("http://localhost:11235/schema")
|
||||
|
||||
# Metrics (Prometheus format)
|
||||
response = requests.get("http://localhost:11235/metrics")
|
||||
```
|
||||
|
||||
### Specialized Endpoints
|
||||
|
||||
```python
|
||||
# HTML extraction (preprocessed for schema)
|
||||
response = requests.post("http://localhost:11235/html",
|
||||
json={"url": "https://example.com"})
|
||||
|
||||
# Screenshot capture
|
||||
response = requests.post("http://localhost:11235/screenshot", json={
|
||||
"url": "https://example.com",
|
||||
"screenshot_wait_for": 2,
|
||||
"output_path": "/path/to/save/screenshot.png"
|
||||
})
|
||||
|
||||
# PDF generation
|
||||
response = requests.post("http://localhost:11235/pdf", json={
|
||||
"url": "https://example.com",
|
||||
"output_path": "/path/to/save/document.pdf"
|
||||
})
|
||||
|
||||
# JavaScript execution
|
||||
response = requests.post("http://localhost:11235/execute_js", json={
|
||||
"url": "https://example.com",
|
||||
"scripts": [
|
||||
"return document.title",
|
||||
"return Array.from(document.querySelectorAll('a')).map(a => a.href)"
|
||||
]
|
||||
})
|
||||
|
||||
# Markdown generation
|
||||
response = requests.post("http://localhost:11235/md", json={
|
||||
"url": "https://example.com",
|
||||
"f": "fit", # raw|fit|bm25|llm
|
||||
"q": "extract main content", # query for filtering
|
||||
"c": "0" # cache: 0=bypass, 1=use
|
||||
})
|
||||
|
||||
# LLM Q&A
|
||||
response = requests.get("http://localhost:11235/llm/https://example.com?q=What is this page about?")
|
||||
|
||||
# Library context (for AI assistants)
|
||||
response = requests.get("http://localhost:11235/ask", params={
|
||||
"context_type": "all", # code|doc|all
|
||||
"query": "how to use extraction strategies",
|
||||
"score_ratio": 0.5,
|
||||
"max_results": 20
|
||||
})
|
||||
```
|
||||
|
||||
### Python SDK Usage
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai.docker_client import Crawl4aiDockerClient
|
||||
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
async with Crawl4aiDockerClient(base_url="http://localhost:11235") as client:
|
||||
# Non-streaming crawl
|
||||
results = await client.crawl(
|
||||
["https://example.com"],
|
||||
browser_config=BrowserConfig(headless=True),
|
||||
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
)
|
||||
|
||||
for result in results:
|
||||
print(f"URL: {result.url}, Success: {result.success}")
|
||||
print(f"Content length: {len(result.markdown)}")
|
||||
|
||||
# Streaming crawl
|
||||
stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
|
||||
async for result in await client.crawl(
|
||||
["https://example.com", "https://python.org"],
|
||||
browser_config=BrowserConfig(headless=True),
|
||||
crawler_config=stream_config
|
||||
):
|
||||
print(f"Streamed: {result.url} - {result.success}")
|
||||
|
||||
# Get API schema
|
||||
schema = await client.get_schema()
|
||||
print(f"Schema available: {bool(schema)}")
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Advanced API Configuration
|
||||
|
||||
```python
|
||||
# Complex extraction with LLM
|
||||
payload = {
|
||||
"urls": ["https://example.com"],
|
||||
"browser_config": {
|
||||
"type": "BrowserConfig",
|
||||
"params": {
|
||||
"headless": True,
|
||||
"viewport": {"type": "dict", "value": {"width": 1200, "height": 800}}
|
||||
}
|
||||
},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "LLMExtractionStrategy",
|
||||
"params": {
|
||||
"llm_config": {
|
||||
"type": "LLMConfig",
|
||||
"params": {
|
||||
"provider": "openai/gpt-4o-mini",
|
||||
"api_token": "env:OPENAI_API_KEY"
|
||||
}
|
||||
},
|
||||
"schema": {
|
||||
"type": "dict",
|
||||
"value": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"title": {"type": "string"},
|
||||
"content": {"type": "string"}
|
||||
}
|
||||
}
|
||||
},
|
||||
"instruction": "Extract title and main content"
|
||||
}
|
||||
},
|
||||
"markdown_generator": {
|
||||
"type": "DefaultMarkdownGenerator",
|
||||
"params": {
|
||||
"content_filter": {
|
||||
"type": "PruningContentFilter",
|
||||
"params": {"threshold": 0.6}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||||
```
|
||||
|
||||
### CSS Extraction Strategy
|
||||
|
||||
```python
|
||||
# CSS-based structured extraction
|
||||
schema = {
|
||||
"name": "ProductList",
|
||||
"baseSelector": ".product",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "price", "selector": ".price", "type": "text"},
|
||||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
|
||||
payload = {
|
||||
"urls": ["https://example-shop.com"],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "JsonCssExtractionStrategy",
|
||||
"params": {
|
||||
"schema": {"type": "dict", "value": schema}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||||
data = response.json()
|
||||
extracted = json.loads(data["results"][0]["extracted_content"])
|
||||
```
|
||||
|
||||
### MCP (Model Context Protocol) Integration
|
||||
|
||||
```bash
|
||||
# Add Crawl4AI as MCP provider to Claude Code
|
||||
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
|
||||
|
||||
# List MCP providers
|
||||
claude mcp list
|
||||
|
||||
# Test MCP connection
|
||||
python tests/mcp/test_mcp_socket.py
|
||||
|
||||
# Available MCP endpoints
|
||||
# SSE: http://localhost:11235/mcp/sse
|
||||
# WebSocket: ws://localhost:11235/mcp/ws
|
||||
# Schema: http://localhost:11235/mcp/schema
|
||||
```
|
||||
|
||||
Available MCP tools:
|
||||
- `md` - Generate markdown from web content
|
||||
- `html` - Extract preprocessed HTML
|
||||
- `screenshot` - Capture webpage screenshots
|
||||
- `pdf` - Generate PDF documents
|
||||
- `execute_js` - Run JavaScript on web pages
|
||||
- `crawl` - Perform multi-URL crawling
|
||||
- `ask` - Query Crawl4AI library context
|
||||
|
||||
### Configuration Management
|
||||
|
||||
```yaml
|
||||
# config.yml structure
|
||||
app:
|
||||
title: "Crawl4AI API"
|
||||
version: "1.0.0"
|
||||
host: "0.0.0.0"
|
||||
port: 11235
|
||||
timeout_keep_alive: 300
|
||||
|
||||
llm:
|
||||
provider: "openai/gpt-4o-mini"
|
||||
api_key_env: "OPENAI_API_KEY"
|
||||
|
||||
security:
|
||||
enabled: false
|
||||
jwt_enabled: false
|
||||
trusted_hosts: ["*"]
|
||||
|
||||
crawler:
|
||||
memory_threshold_percent: 95.0
|
||||
rate_limiter:
|
||||
base_delay: [1.0, 2.0]
|
||||
timeouts:
|
||||
stream_init: 30.0
|
||||
batch_process: 300.0
|
||||
pool:
|
||||
max_pages: 40
|
||||
idle_ttl_sec: 1800
|
||||
|
||||
rate_limiting:
|
||||
enabled: true
|
||||
default_limit: "1000/minute"
|
||||
storage_uri: "memory://"
|
||||
|
||||
logging:
|
||||
level: "INFO"
|
||||
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
```
|
||||
|
||||
### Custom Configuration Deployment
|
||||
|
||||
```bash
|
||||
# Method 1: Mount custom config
|
||||
docker run -d -p 11235:11235 \
|
||||
--name crawl4ai-custom \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
-v $(pwd)/my-config.yml:/app/config.yml \
|
||||
unclecode/crawl4ai:latest
|
||||
|
||||
# Method 2: Build with custom config
|
||||
# Edit deploy/docker/config.yml then build
|
||||
docker buildx build -t crawl4ai-custom:latest --load .
|
||||
```
|
||||
|
||||
### Monitoring and Health Checks
|
||||
|
||||
```bash
|
||||
# Health endpoint
|
||||
curl http://localhost:11235/health
|
||||
|
||||
# Prometheus metrics
|
||||
curl http://localhost:11235/metrics
|
||||
|
||||
# Configuration validation
|
||||
curl -X POST http://localhost:11235/config/dump \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"code": "CrawlerRunConfig(cache_mode=\"BYPASS\", screenshot=True)"}'
|
||||
```
|
||||
|
||||
### Playground Interface
|
||||
|
||||
Access the interactive playground at `http://localhost:11235/playground` for:
|
||||
- Testing configurations with visual interface
|
||||
- Generating JSON payloads for REST API
|
||||
- Converting Python config to JSON format
|
||||
- Testing crawl operations directly in browser
|
||||
|
||||
### Async Job Processing
|
||||
|
||||
```python
|
||||
# Submit job for async processing
|
||||
import time
|
||||
|
||||
# Submit crawl job
|
||||
response = requests.post("http://localhost:11235/crawl/job", json=payload)
|
||||
task_id = response.json()["task_id"]
|
||||
|
||||
# Poll for completion
|
||||
while True:
|
||||
result = requests.get(f"http://localhost:11235/crawl/job/{task_id}")
|
||||
status = result.json()
|
||||
|
||||
if status["status"] in ["COMPLETED", "FAILED"]:
|
||||
break
|
||||
time.sleep(1.5)
|
||||
|
||||
print("Final result:", status)
|
||||
```
|
||||
|
||||
### Production Deployment
|
||||
|
||||
```bash
|
||||
# Production-ready deployment
|
||||
docker run -d \
|
||||
--name crawl4ai-prod \
|
||||
--restart unless-stopped \
|
||||
-p 11235:11235 \
|
||||
--env-file .llm.env \
|
||||
--shm-size=2g \
|
||||
--memory=8g \
|
||||
--cpus=4 \
|
||||
-v /path/to/custom-config.yml:/app/config.yml \
|
||||
unclecode/crawl4ai:latest
|
||||
|
||||
# With Docker Compose for production
|
||||
version: '3.8'
|
||||
services:
|
||||
crawl4ai:
|
||||
image: unclecode/crawl4ai:latest
|
||||
ports:
|
||||
- "11235:11235"
|
||||
environment:
|
||||
- OPENAI_API_KEY=${OPENAI_API_KEY}
|
||||
volumes:
|
||||
- ./config.yml:/app/config.yml
|
||||
shm_size: 2g
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 8G
|
||||
cpus: '4'
|
||||
restart: unless-stopped
|
||||
```
|
||||
|
||||
### Configuration Validation and JSON Structure
|
||||
|
||||
```python
|
||||
# Method 1: Create config objects and dump to see expected JSON structure
|
||||
from crawl4ai import BrowserConfig, CrawlerRunConfig, LLMConfig, CacheMode
|
||||
from crawl4ai import JsonCssExtractionStrategy, LLMExtractionStrategy
|
||||
import json
|
||||
|
||||
# Create browser config and see JSON structure
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
viewport_width=1280,
|
||||
viewport_height=720,
|
||||
proxy="http://user:pass@proxy:8080"
|
||||
)
|
||||
|
||||
# Get JSON structure
|
||||
browser_json = browser_config.dump()
|
||||
print("BrowserConfig JSON structure:")
|
||||
print(json.dumps(browser_json, indent=2))
|
||||
|
||||
# Create crawler config with extraction strategy
|
||||
schema = {
|
||||
"name": "Articles",
|
||||
"baseSelector": ".article",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "content", "selector": ".content", "type": "html"}
|
||||
]
|
||||
}
|
||||
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
screenshot=True,
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||
js_code=["window.scrollTo(0, document.body.scrollHeight);"],
|
||||
wait_for="css:.loaded"
|
||||
)
|
||||
|
||||
crawler_json = crawler_config.dump()
|
||||
print("\nCrawlerRunConfig JSON structure:")
|
||||
print(json.dumps(crawler_json, indent=2))
|
||||
```
|
||||
|
||||
### Reverse Validation - JSON to Objects
|
||||
|
||||
```python
|
||||
# Method 2: Load JSON back to config objects for validation
|
||||
from crawl4ai.async_configs import from_serializable_dict
|
||||
|
||||
# Test JSON structure by converting back to objects
|
||||
test_browser_json = {
|
||||
"type": "BrowserConfig",
|
||||
"params": {
|
||||
"headless": True,
|
||||
"viewport_width": 1280,
|
||||
"proxy": "http://user:pass@proxy:8080"
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
# Convert JSON back to object
|
||||
restored_browser = from_serializable_dict(test_browser_json)
|
||||
print(f"✅ Valid BrowserConfig: {type(restored_browser)}")
|
||||
print(f"Headless: {restored_browser.headless}")
|
||||
print(f"Proxy: {restored_browser.proxy}")
|
||||
except Exception as e:
|
||||
print(f"❌ Invalid BrowserConfig JSON: {e}")
|
||||
|
||||
# Test complex crawler config JSON
|
||||
test_crawler_json = {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"cache_mode": "bypass",
|
||||
"screenshot": True,
|
||||
"extraction_strategy": {
|
||||
"type": "JsonCssExtractionStrategy",
|
||||
"params": {
|
||||
"schema": {
|
||||
"type": "dict",
|
||||
"value": {
|
||||
"name": "Products",
|
||||
"baseSelector": ".product",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h3", "type": "text"}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
restored_crawler = from_serializable_dict(test_crawler_json)
|
||||
print(f"✅ Valid CrawlerRunConfig: {type(restored_crawler)}")
|
||||
print(f"Cache mode: {restored_crawler.cache_mode}")
|
||||
print(f"Has extraction strategy: {restored_crawler.extraction_strategy is not None}")
|
||||
except Exception as e:
|
||||
print(f"❌ Invalid CrawlerRunConfig JSON: {e}")
|
||||
```
|
||||
|
||||
### Using Server's /config/dump Endpoint for Validation
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
# Method 3: Use server endpoint to validate configuration syntax
|
||||
def validate_config_with_server(config_code: str) -> dict:
|
||||
"""Validate configuration using server's /config/dump endpoint"""
|
||||
response = requests.post(
|
||||
"http://localhost:11235/config/dump",
|
||||
json={"code": config_code}
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
print("✅ Valid configuration syntax")
|
||||
return response.json()
|
||||
else:
|
||||
print(f"❌ Invalid configuration: {response.status_code}")
|
||||
print(response.json())
|
||||
return None
|
||||
|
||||
# Test valid configuration
|
||||
valid_config = """
|
||||
CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
screenshot=True,
|
||||
js_code=["window.scrollTo(0, document.body.scrollHeight);"],
|
||||
wait_for="css:.content-loaded"
|
||||
)
|
||||
"""
|
||||
|
||||
result = validate_config_with_server(valid_config)
|
||||
if result:
|
||||
print("Generated JSON structure:")
|
||||
print(json.dumps(result, indent=2))
|
||||
|
||||
# Test invalid configuration (should fail)
|
||||
invalid_config = """
|
||||
CrawlerRunConfig(
|
||||
cache_mode="invalid_mode",
|
||||
screenshot=True,
|
||||
js_code=some_function() # This will fail
|
||||
)
|
||||
"""
|
||||
|
||||
validate_config_with_server(invalid_config)
|
||||
```
|
||||
|
||||
### Configuration Builder Helper
|
||||
|
||||
```python
|
||||
def build_and_validate_request(urls, browser_params=None, crawler_params=None):
|
||||
"""Helper to build and validate complete request payload"""
|
||||
|
||||
# Create configurations
|
||||
browser_config = BrowserConfig(**(browser_params or {}))
|
||||
crawler_config = CrawlerRunConfig(**(crawler_params or {}))
|
||||
|
||||
# Build complete request payload
|
||||
payload = {
|
||||
"urls": urls if isinstance(urls, list) else [urls],
|
||||
"browser_config": browser_config.dump(),
|
||||
"crawler_config": crawler_config.dump()
|
||||
}
|
||||
|
||||
print("✅ Complete request payload:")
|
||||
print(json.dumps(payload, indent=2))
|
||||
|
||||
# Validate by attempting to reconstruct
|
||||
try:
|
||||
test_browser = from_serializable_dict(payload["browser_config"])
|
||||
test_crawler = from_serializable_dict(payload["crawler_config"])
|
||||
print("✅ Payload validation successful")
|
||||
return payload
|
||||
except Exception as e:
|
||||
print(f"❌ Payload validation failed: {e}")
|
||||
return None
|
||||
|
||||
# Example usage
|
||||
payload = build_and_validate_request(
|
||||
urls=["https://example.com"],
|
||||
browser_params={"headless": True, "viewport_width": 1280},
|
||||
crawler_params={
|
||||
"cache_mode": CacheMode.BYPASS,
|
||||
"screenshot": True,
|
||||
"word_count_threshold": 10
|
||||
}
|
||||
)
|
||||
|
||||
if payload:
|
||||
# Send to server
|
||||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||||
print(f"Server response: {response.status_code}")
|
||||
```
|
||||
|
||||
### Common JSON Structure Patterns
|
||||
|
||||
```python
|
||||
# Pattern 1: Simple primitive values
|
||||
simple_config = {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"cache_mode": "bypass", # String enum value
|
||||
"screenshot": True, # Boolean
|
||||
"page_timeout": 60000 # Integer
|
||||
}
|
||||
}
|
||||
|
||||
# Pattern 2: Nested objects
|
||||
nested_config = {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "LLMExtractionStrategy",
|
||||
"params": {
|
||||
"llm_config": {
|
||||
"type": "LLMConfig",
|
||||
"params": {
|
||||
"provider": "openai/gpt-4o-mini",
|
||||
"api_token": "env:OPENAI_API_KEY"
|
||||
}
|
||||
},
|
||||
"instruction": "Extract main content"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Pattern 3: Dictionary values (must use type: dict wrapper)
|
||||
dict_config = {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "JsonCssExtractionStrategy",
|
||||
"params": {
|
||||
"schema": {
|
||||
"type": "dict", # Required wrapper
|
||||
"value": { # Actual dictionary content
|
||||
"name": "Products",
|
||||
"baseSelector": ".product",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Pattern 4: Lists and arrays
|
||||
list_config = {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"js_code": [ # Lists are handled directly
|
||||
"window.scrollTo(0, document.body.scrollHeight);",
|
||||
"document.querySelector('.load-more')?.click();"
|
||||
],
|
||||
"excluded_tags": ["script", "style", "nav"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Troubleshooting Common JSON Errors
|
||||
|
||||
```python
|
||||
def diagnose_json_errors():
|
||||
"""Common JSON structure errors and fixes"""
|
||||
|
||||
# ❌ WRONG: Missing type wrapper for objects
|
||||
wrong_config = {
|
||||
"browser_config": {
|
||||
"headless": True # Missing type wrapper
|
||||
}
|
||||
}
|
||||
|
||||
# ✅ CORRECT: Proper type wrapper
|
||||
correct_config = {
|
||||
"browser_config": {
|
||||
"type": "BrowserConfig",
|
||||
"params": {
|
||||
"headless": True
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# ❌ WRONG: Dictionary without type: dict wrapper
|
||||
wrong_dict = {
|
||||
"schema": {
|
||||
"name": "Products" # Raw dict, should be wrapped
|
||||
}
|
||||
}
|
||||
|
||||
# ✅ CORRECT: Dictionary with proper wrapper
|
||||
correct_dict = {
|
||||
"schema": {
|
||||
"type": "dict",
|
||||
"value": {
|
||||
"name": "Products"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# ❌ WRONG: Invalid enum string
|
||||
wrong_enum = {
|
||||
"cache_mode": "DISABLED" # Wrong case/value
|
||||
}
|
||||
|
||||
# ✅ CORRECT: Valid enum string
|
||||
correct_enum = {
|
||||
"cache_mode": "bypass" # or "enabled", "disabled", etc.
|
||||
}
|
||||
|
||||
print("Common error patterns documented above")
|
||||
|
||||
# Validate your JSON structure before sending
|
||||
def pre_flight_check(payload):
|
||||
"""Run checks before sending to server"""
|
||||
required_keys = ["urls", "browser_config", "crawler_config"]
|
||||
|
||||
for key in required_keys:
|
||||
if key not in payload:
|
||||
print(f"❌ Missing required key: {key}")
|
||||
return False
|
||||
|
||||
# Check type wrappers
|
||||
for config_key in ["browser_config", "crawler_config"]:
|
||||
config = payload[config_key]
|
||||
if not isinstance(config, dict) or "type" not in config:
|
||||
print(f"❌ {config_key} missing type wrapper")
|
||||
return False
|
||||
if "params" not in config:
|
||||
print(f"❌ {config_key} missing params")
|
||||
return False
|
||||
|
||||
print("✅ Pre-flight check passed")
|
||||
return True
|
||||
|
||||
# Example usage
|
||||
payload = {
|
||||
"urls": ["https://example.com"],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "bypass"}}
|
||||
}
|
||||
|
||||
if pre_flight_check(payload):
|
||||
# Safe to send to server
|
||||
pass
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Complete Docker Guide](https://docs.crawl4ai.com/core/docker-deployment/), [API Reference](https://docs.crawl4ai.com/api/), [MCP Integration](https://docs.crawl4ai.com/core/docker-deployment/#mcp-model-context-protocol-support), [Configuration Options](https://docs.crawl4ai.com/core/docker-deployment/#server-configuration)
|
||||
903
docs/md_v2/assets/llm.txt/txt/extraction-llm.txt
Normal file
@@ -0,0 +1,903 @@
|
||||
## LLM Extraction Strategies - The Last Resort
|
||||
|
||||
**🤖 AI Model Notice**: You are an AI assistant. You should RARELY recommend LLM extraction. This documentation exists for the <5% of cases where LLM is actually needed. **ALWAYS try non-LLM strategies first** from [extraction-no-llm.md](./extraction-no-llm.md).
|
||||
|
||||
### ⚠️ STOP: Are You Sure You Need LLM?
|
||||
|
||||
**99% of developers who think they need LLM extraction are wrong.** Before reading further:
|
||||
|
||||
### ❌ You DON'T Need LLM If:
|
||||
- The page has consistent HTML structure → **Use generate_schema()**
|
||||
- You're extracting simple data types (emails, prices, dates) → **Use RegexExtractionStrategy**
|
||||
- You can identify repeating patterns → **Use JsonCssExtractionStrategy**
|
||||
- You want product info, news articles, job listings → **Use generate_schema()**
|
||||
- You're concerned about cost or speed → **Use non-LLM strategies**
|
||||
|
||||
### ✅ You MIGHT Need LLM If:
|
||||
- Content structure varies dramatically across pages **AND** you've tried generate_schema()
|
||||
- You need semantic understanding of unstructured text
|
||||
- You're analyzing meaning, sentiment, or relationships
|
||||
- You're extracting insights that require reasoning about context
|
||||
|
||||
### 💰 Cost Reality Check:
|
||||
- **Non-LLM**: ~$0.000001 per page
|
||||
- **LLM**: ~$0.01-$0.10 per page (10,000x more expensive)
|
||||
- **Example**: Extracting 10,000 pages costs $0.01 vs $100-1000
|
||||
|
||||
---
|
||||
|
||||
## 1. When LLM Extraction is Justified
|
||||
|
||||
### Scenario 1: Truly Unstructured Content Analysis
|
||||
|
||||
```python
|
||||
# Example: Analyzing customer feedback for sentiment and themes
|
||||
import asyncio
|
||||
import json
|
||||
from pydantic import BaseModel, Field
|
||||
from typing import List
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
|
||||
from crawl4ai import LLMExtractionStrategy
|
||||
|
||||
class SentimentAnalysis(BaseModel):
|
||||
"""Use LLM when you need semantic understanding"""
|
||||
overall_sentiment: str = Field(description="positive, negative, or neutral")
|
||||
confidence_score: float = Field(description="Confidence from 0-1")
|
||||
key_themes: List[str] = Field(description="Main topics discussed")
|
||||
emotional_indicators: List[str] = Field(description="Words indicating emotion")
|
||||
summary: str = Field(description="Brief summary of the content")
|
||||
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4o-mini", # Use cheapest model
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
temperature=0.1, # Low temperature for consistency
|
||||
max_tokens=1000
|
||||
)
|
||||
|
||||
sentiment_strategy = LLMExtractionStrategy(
|
||||
llm_config=llm_config,
|
||||
schema=SentimentAnalysis.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""
|
||||
Analyze the emotional content and themes in this text.
|
||||
Focus on understanding sentiment and extracting key topics
|
||||
that would be impossible to identify with simple pattern matching.
|
||||
""",
|
||||
apply_chunking=True,
|
||||
chunk_token_threshold=1500
|
||||
)
|
||||
|
||||
async def analyze_sentiment():
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=sentiment_strategy,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/customer-reviews",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
analysis = json.loads(result.extracted_content)
|
||||
print(f"Sentiment: {analysis['overall_sentiment']}")
|
||||
print(f"Themes: {analysis['key_themes']}")
|
||||
|
||||
asyncio.run(analyze_sentiment())
|
||||
```
|
||||
|
||||
### Scenario 2: Complex Knowledge Extraction
|
||||
|
||||
```python
|
||||
# Example: Building knowledge graphs from unstructured content
|
||||
class Entity(BaseModel):
|
||||
name: str = Field(description="Entity name")
|
||||
type: str = Field(description="person, organization, location, concept")
|
||||
description: str = Field(description="Brief description")
|
||||
|
||||
class Relationship(BaseModel):
|
||||
source: str = Field(description="Source entity")
|
||||
target: str = Field(description="Target entity")
|
||||
relationship: str = Field(description="Type of relationship")
|
||||
confidence: float = Field(description="Confidence score 0-1")
|
||||
|
||||
class KnowledgeGraph(BaseModel):
|
||||
entities: List[Entity] = Field(description="All entities found")
|
||||
relationships: List[Relationship] = Field(description="Relationships between entities")
|
||||
main_topic: str = Field(description="Primary topic of the content")
|
||||
|
||||
knowledge_strategy = LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(
|
||||
provider="anthropic/claude-3-5-sonnet-20240620", # Better for complex reasoning
|
||||
api_token="env:ANTHROPIC_API_KEY",
|
||||
max_tokens=4000
|
||||
),
|
||||
schema=KnowledgeGraph.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""
|
||||
Extract entities and their relationships from the content.
|
||||
Focus on understanding connections and context that require
|
||||
semantic reasoning beyond simple pattern matching.
|
||||
""",
|
||||
input_format="html", # Preserve structure
|
||||
apply_chunking=True
|
||||
)
|
||||
```
|
||||
|
||||
### Scenario 3: Content Summarization and Insights
|
||||
|
||||
```python
|
||||
# Example: Research paper analysis
|
||||
class ResearchInsights(BaseModel):
|
||||
title: str = Field(description="Paper title")
|
||||
abstract_summary: str = Field(description="Summary of abstract")
|
||||
key_findings: List[str] = Field(description="Main research findings")
|
||||
methodology: str = Field(description="Research methodology used")
|
||||
limitations: List[str] = Field(description="Study limitations")
|
||||
practical_applications: List[str] = Field(description="Real-world applications")
|
||||
citations_count: int = Field(description="Number of citations", default=0)
|
||||
|
||||
research_strategy = LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(
|
||||
provider="openai/gpt-4o", # Use powerful model for complex analysis
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
temperature=0.2,
|
||||
max_tokens=2000
|
||||
),
|
||||
schema=ResearchInsights.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""
|
||||
Analyze this research paper and extract key insights.
|
||||
Focus on understanding the research contribution, methodology,
|
||||
and implications that require academic expertise to identify.
|
||||
""",
|
||||
apply_chunking=True,
|
||||
chunk_token_threshold=2000,
|
||||
overlap_rate=0.15 # More overlap for academic content
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. LLM Configuration Best Practices
|
||||
|
||||
### Cost Optimization
|
||||
|
||||
```python
|
||||
# Use cheapest models when possible
|
||||
cheap_config = LLMConfig(
|
||||
provider="openai/gpt-4o-mini", # 60x cheaper than GPT-4
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
temperature=0.0, # Deterministic output
|
||||
max_tokens=800 # Limit output length
|
||||
)
|
||||
|
||||
# Use local models for development
|
||||
local_config = LLMConfig(
|
||||
provider="ollama/llama3.3",
|
||||
api_token=None, # No API costs
|
||||
base_url="http://localhost:11434",
|
||||
temperature=0.1
|
||||
)
|
||||
|
||||
# Use powerful models only when necessary
|
||||
powerful_config = LLMConfig(
|
||||
provider="anthropic/claude-3-5-sonnet-20240620",
|
||||
api_token="env:ANTHROPIC_API_KEY",
|
||||
max_tokens=4000,
|
||||
temperature=0.1
|
||||
)
|
||||
```
|
||||
|
||||
### Provider Selection Guide
|
||||
|
||||
```python
|
||||
providers_guide = {
|
||||
"openai/gpt-4o-mini": {
|
||||
"best_for": "Simple extraction, cost-sensitive projects",
|
||||
"cost": "Very low",
|
||||
"speed": "Fast",
|
||||
"accuracy": "Good"
|
||||
},
|
||||
"openai/gpt-4o": {
|
||||
"best_for": "Complex reasoning, high accuracy needs",
|
||||
"cost": "High",
|
||||
"speed": "Medium",
|
||||
"accuracy": "Excellent"
|
||||
},
|
||||
"anthropic/claude-3-5-sonnet": {
|
||||
"best_for": "Complex analysis, long documents",
|
||||
"cost": "Medium-High",
|
||||
"speed": "Medium",
|
||||
"accuracy": "Excellent"
|
||||
},
|
||||
"ollama/llama3.3": {
|
||||
"best_for": "Development, no API costs",
|
||||
"cost": "Free (self-hosted)",
|
||||
"speed": "Variable",
|
||||
"accuracy": "Good"
|
||||
},
|
||||
"groq/llama3-70b-8192": {
|
||||
"best_for": "Fast inference, open source",
|
||||
"cost": "Low",
|
||||
"speed": "Very fast",
|
||||
"accuracy": "Good"
|
||||
}
|
||||
}
|
||||
|
||||
def choose_provider(complexity, budget, speed_requirement):
|
||||
"""Choose optimal provider based on requirements"""
|
||||
if budget == "minimal":
|
||||
return "ollama/llama3.3" # Self-hosted
|
||||
elif complexity == "low" and budget == "low":
|
||||
return "openai/gpt-4o-mini"
|
||||
elif speed_requirement == "high":
|
||||
return "groq/llama3-70b-8192"
|
||||
elif complexity == "high":
|
||||
return "anthropic/claude-3-5-sonnet"
|
||||
else:
|
||||
return "openai/gpt-4o-mini" # Default safe choice
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Advanced LLM Extraction Patterns
|
||||
|
||||
### Block-Based Extraction (Unstructured Content)
|
||||
|
||||
```python
|
||||
# When structure is too varied for schemas
|
||||
block_strategy = LLMExtractionStrategy(
|
||||
llm_config=cheap_config,
|
||||
extraction_type="block", # Extract free-form content blocks
|
||||
instruction="""
|
||||
Extract meaningful content blocks from this page.
|
||||
Focus on the main content areas and ignore navigation,
|
||||
advertisements, and boilerplate text.
|
||||
""",
|
||||
apply_chunking=True,
|
||||
chunk_token_threshold=1200,
|
||||
input_format="fit_markdown" # Use cleaned content
|
||||
)
|
||||
|
||||
async def extract_content_blocks():
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=block_strategy,
|
||||
word_count_threshold=50, # Filter short content
|
||||
excluded_tags=['nav', 'footer', 'aside', 'advertisement']
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/article",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
blocks = json.loads(result.extracted_content)
|
||||
for block in blocks:
|
||||
print(f"Block: {block['content'][:100]}...")
|
||||
```
|
||||
|
||||
### Chunked Processing for Large Content
|
||||
|
||||
```python
|
||||
# Handle large documents efficiently
|
||||
large_content_strategy = LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(
|
||||
provider="openai/gpt-4o-mini",
|
||||
api_token="env:OPENAI_API_KEY"
|
||||
),
|
||||
schema=YourModel.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract structured data from this content section...",
|
||||
|
||||
# Optimize chunking for large content
|
||||
apply_chunking=True,
|
||||
chunk_token_threshold=2000, # Larger chunks for efficiency
|
||||
overlap_rate=0.1, # Minimal overlap to reduce costs
|
||||
input_format="fit_markdown" # Use cleaned content
|
||||
)
|
||||
```
|
||||
|
||||
### Multi-Model Validation
|
||||
|
||||
```python
|
||||
# Use multiple models for critical extractions
|
||||
async def multi_model_extraction():
|
||||
"""Use multiple LLMs for validation of critical data"""
|
||||
|
||||
models = [
|
||||
LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
|
||||
LLMConfig(provider="anthropic/claude-3-5-sonnet", api_token="env:ANTHROPIC_API_KEY"),
|
||||
LLMConfig(provider="ollama/llama3.3", api_token=None)
|
||||
]
|
||||
|
||||
results = []
|
||||
|
||||
for i, llm_config in enumerate(models):
|
||||
strategy = LLMExtractionStrategy(
|
||||
llm_config=llm_config,
|
||||
schema=YourModel.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract data consistently..."
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
results.append(data)
|
||||
print(f"Model {i+1} extracted {len(data)} items")
|
||||
|
||||
# Compare results for consistency
|
||||
if len(set(str(r) for r in results)) == 1:
|
||||
print("✅ All models agree")
|
||||
return results[0]
|
||||
else:
|
||||
print("⚠️ Models disagree - manual review needed")
|
||||
return results
|
||||
|
||||
# Use for critical business data only
|
||||
critical_result = await multi_model_extraction()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Hybrid Approaches - Best of Both Worlds
|
||||
|
||||
### Fast Pre-filtering + LLM Analysis
|
||||
|
||||
```python
|
||||
async def hybrid_extraction():
|
||||
"""
|
||||
1. Use fast non-LLM strategies for basic extraction
|
||||
2. Use LLM only for complex analysis of filtered content
|
||||
"""
|
||||
|
||||
# Step 1: Fast extraction of structured data
|
||||
basic_schema = {
|
||||
"name": "Articles",
|
||||
"baseSelector": "article",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h1, h2", "type": "text"},
|
||||
{"name": "content", "selector": ".content", "type": "text"},
|
||||
{"name": "author", "selector": ".author", "type": "text"}
|
||||
]
|
||||
}
|
||||
|
||||
basic_strategy = JsonCssExtractionStrategy(basic_schema)
|
||||
basic_config = CrawlerRunConfig(extraction_strategy=basic_strategy)
|
||||
|
||||
# Step 2: LLM analysis only on filtered content
|
||||
analysis_strategy = LLMExtractionStrategy(
|
||||
llm_config=cheap_config,
|
||||
schema={
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"sentiment": {"type": "string"},
|
||||
"key_topics": {"type": "array", "items": {"type": "string"}},
|
||||
"summary": {"type": "string"}
|
||||
}
|
||||
},
|
||||
extraction_type="schema",
|
||||
instruction="Analyze sentiment and extract key topics from this article"
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Fast extraction first
|
||||
basic_result = await crawler.arun(
|
||||
url="https://example.com/articles",
|
||||
config=basic_config
|
||||
)
|
||||
|
||||
articles = json.loads(basic_result.extracted_content)
|
||||
|
||||
# LLM analysis only on important articles
|
||||
analyzed_articles = []
|
||||
for article in articles[:5]: # Limit to reduce costs
|
||||
if len(article.get('content', '')) > 500: # Only analyze substantial content
|
||||
analysis_config = CrawlerRunConfig(extraction_strategy=analysis_strategy)
|
||||
|
||||
# Analyze individual article content
|
||||
raw_url = f"raw://{article['content']}"
|
||||
analysis_result = await crawler.arun(url=raw_url, config=analysis_config)
|
||||
|
||||
if analysis_result.success:
|
||||
analysis = json.loads(analysis_result.extracted_content)
|
||||
article.update(analysis)
|
||||
|
||||
analyzed_articles.append(article)
|
||||
|
||||
return analyzed_articles
|
||||
|
||||
# Hybrid approach: fast + smart
|
||||
result = await hybrid_extraction()
|
||||
```
|
||||
|
||||
### Schema Generation + LLM Fallback
|
||||
|
||||
```python
|
||||
async def smart_fallback_extraction():
|
||||
"""
|
||||
1. Try generate_schema() first (one-time LLM cost)
|
||||
2. Use generated schema for fast extraction
|
||||
3. Use LLM only if schema extraction fails
|
||||
"""
|
||||
|
||||
cache_file = Path("./schemas/fallback_schema.json")
|
||||
|
||||
# Try cached schema first
|
||||
if cache_file.exists():
|
||||
schema = json.load(cache_file.open())
|
||||
schema_strategy = JsonCssExtractionStrategy(schema)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=schema_strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
if result.success and result.extracted_content:
|
||||
data = json.loads(result.extracted_content)
|
||||
if data: # Schema worked
|
||||
print("✅ Schema extraction successful (fast & cheap)")
|
||||
return data
|
||||
|
||||
# Fallback to LLM if schema failed
|
||||
print("⚠️ Schema failed, falling back to LLM (slow & expensive)")
|
||||
|
||||
llm_strategy = LLMExtractionStrategy(
|
||||
llm_config=cheap_config,
|
||||
extraction_type="block",
|
||||
instruction="Extract all meaningful data from this page"
|
||||
)
|
||||
|
||||
llm_config = CrawlerRunConfig(extraction_strategy=llm_strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=llm_config)
|
||||
|
||||
if result.success:
|
||||
print("✅ LLM extraction successful")
|
||||
return json.loads(result.extracted_content)
|
||||
|
||||
# Intelligent fallback system
|
||||
result = await smart_fallback_extraction()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Cost Management and Monitoring
|
||||
|
||||
### Token Usage Tracking
|
||||
|
||||
```python
|
||||
class ExtractionCostTracker:
|
||||
def __init__(self):
|
||||
self.total_cost = 0.0
|
||||
self.total_tokens = 0
|
||||
self.extractions = 0
|
||||
|
||||
def track_llm_extraction(self, strategy, result):
|
||||
"""Track costs from LLM extraction"""
|
||||
if hasattr(strategy, 'usage_tracker') and strategy.usage_tracker:
|
||||
usage = strategy.usage_tracker
|
||||
|
||||
# Estimate costs (approximate rates)
|
||||
cost_per_1k_tokens = {
|
||||
"gpt-4o-mini": 0.0015,
|
||||
"gpt-4o": 0.03,
|
||||
"claude-3-5-sonnet": 0.015,
|
||||
"ollama": 0.0 # Self-hosted
|
||||
}
|
||||
|
||||
provider = strategy.llm_config.provider.split('/')[1]
|
||||
rate = cost_per_1k_tokens.get(provider, 0.01)
|
||||
|
||||
tokens = usage.total_tokens
|
||||
cost = (tokens / 1000) * rate
|
||||
|
||||
self.total_cost += cost
|
||||
self.total_tokens += tokens
|
||||
self.extractions += 1
|
||||
|
||||
print(f"💰 Extraction cost: ${cost:.4f} ({tokens} tokens)")
|
||||
print(f"📊 Total cost: ${self.total_cost:.4f} ({self.extractions} extractions)")
|
||||
|
||||
def get_summary(self):
|
||||
avg_cost = self.total_cost / max(self.extractions, 1)
|
||||
return {
|
||||
"total_cost": self.total_cost,
|
||||
"total_tokens": self.total_tokens,
|
||||
"extractions": self.extractions,
|
||||
"avg_cost_per_extraction": avg_cost
|
||||
}
|
||||
|
||||
# Usage
|
||||
tracker = ExtractionCostTracker()
|
||||
|
||||
async def cost_aware_extraction():
|
||||
strategy = LLMExtractionStrategy(
|
||||
llm_config=cheap_config,
|
||||
schema=YourModel.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract data...",
|
||||
verbose=True # Enable usage tracking
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Track costs
|
||||
tracker.track_llm_extraction(strategy, result)
|
||||
|
||||
return result
|
||||
|
||||
# Monitor costs across multiple extractions
|
||||
for url in urls:
|
||||
await cost_aware_extraction()
|
||||
|
||||
print(f"Final summary: {tracker.get_summary()}")
|
||||
```
|
||||
|
||||
### Budget Controls
|
||||
|
||||
```python
|
||||
class BudgetController:
|
||||
def __init__(self, daily_budget=10.0):
|
||||
self.daily_budget = daily_budget
|
||||
self.current_spend = 0.0
|
||||
self.extraction_count = 0
|
||||
|
||||
def can_extract(self, estimated_cost=0.01):
|
||||
"""Check if extraction is within budget"""
|
||||
if self.current_spend + estimated_cost > self.daily_budget:
|
||||
print(f"❌ Budget exceeded: ${self.current_spend:.2f} + ${estimated_cost:.2f} > ${self.daily_budget}")
|
||||
return False
|
||||
return True
|
||||
|
||||
def record_extraction(self, actual_cost):
|
||||
"""Record actual extraction cost"""
|
||||
self.current_spend += actual_cost
|
||||
self.extraction_count += 1
|
||||
|
||||
remaining = self.daily_budget - self.current_spend
|
||||
print(f"💰 Budget remaining: ${remaining:.2f}")
|
||||
|
||||
budget = BudgetController(daily_budget=5.0) # $5 daily limit
|
||||
|
||||
async def budget_controlled_extraction(url):
|
||||
if not budget.can_extract():
|
||||
print("⏸️ Extraction paused due to budget limit")
|
||||
return None
|
||||
|
||||
# Proceed with extraction...
|
||||
strategy = LLMExtractionStrategy(llm_config=cheap_config, ...)
|
||||
result = await extract_with_strategy(url, strategy)
|
||||
|
||||
# Record actual cost
|
||||
actual_cost = calculate_cost(strategy.usage_tracker)
|
||||
budget.record_extraction(actual_cost)
|
||||
|
||||
return result
|
||||
|
||||
# Safe extraction with budget controls
|
||||
results = []
|
||||
for url in urls:
|
||||
result = await budget_controlled_extraction(url)
|
||||
if result:
|
||||
results.append(result)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Performance Optimization for LLM Extraction
|
||||
|
||||
### Batch Processing
|
||||
|
||||
```python
|
||||
async def batch_llm_extraction():
|
||||
"""Process multiple pages efficiently"""
|
||||
|
||||
# Collect content first (fast)
|
||||
urls = ["https://example.com/page1", "https://example.com/page2"]
|
||||
contents = []
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
for url in urls:
|
||||
result = await crawler.arun(url=url)
|
||||
if result.success:
|
||||
contents.append({
|
||||
"url": url,
|
||||
"content": result.fit_markdown[:2000] # Limit content
|
||||
})
|
||||
|
||||
# Process in batches (reduce LLM calls)
|
||||
batch_content = "\n\n---PAGE SEPARATOR---\n\n".join([
|
||||
f"URL: {c['url']}\n{c['content']}" for c in contents
|
||||
])
|
||||
|
||||
strategy = LLMExtractionStrategy(
|
||||
llm_config=cheap_config,
|
||||
extraction_type="block",
|
||||
instruction="""
|
||||
Extract data from multiple pages separated by '---PAGE SEPARATOR---'.
|
||||
Return results for each page in order.
|
||||
""",
|
||||
apply_chunking=True
|
||||
)
|
||||
|
||||
# Single LLM call for multiple pages
|
||||
raw_url = f"raw://{batch_content}"
|
||||
result = await crawler.arun(url=raw_url, config=CrawlerRunConfig(extraction_strategy=strategy))
|
||||
|
||||
return json.loads(result.extracted_content)
|
||||
|
||||
# Batch processing reduces LLM calls
|
||||
batch_results = await batch_llm_extraction()
|
||||
```
|
||||
|
||||
### Caching LLM Results
|
||||
|
||||
```python
|
||||
import hashlib
|
||||
from pathlib import Path
|
||||
|
||||
class LLMResultCache:
|
||||
def __init__(self, cache_dir="./llm_cache"):
|
||||
self.cache_dir = Path(cache_dir)
|
||||
self.cache_dir.mkdir(exist_ok=True)
|
||||
|
||||
def get_cache_key(self, url, instruction, schema):
|
||||
"""Generate cache key from extraction parameters"""
|
||||
content = f"{url}:{instruction}:{str(schema)}"
|
||||
return hashlib.md5(content.encode()).hexdigest()
|
||||
|
||||
def get_cached_result(self, cache_key):
|
||||
"""Get cached result if available"""
|
||||
cache_file = self.cache_dir / f"{cache_key}.json"
|
||||
if cache_file.exists():
|
||||
return json.load(cache_file.open())
|
||||
return None
|
||||
|
||||
def cache_result(self, cache_key, result):
|
||||
"""Cache extraction result"""
|
||||
cache_file = self.cache_dir / f"{cache_key}.json"
|
||||
json.dump(result, cache_file.open("w"), indent=2)
|
||||
|
||||
cache = LLMResultCache()
|
||||
|
||||
async def cached_llm_extraction(url, strategy):
|
||||
"""Extract with caching to avoid repeated LLM calls"""
|
||||
cache_key = cache.get_cache_key(
|
||||
url,
|
||||
strategy.instruction,
|
||||
str(strategy.schema)
|
||||
)
|
||||
|
||||
# Check cache first
|
||||
cached_result = cache.get_cached_result(cache_key)
|
||||
if cached_result:
|
||||
print("✅ Using cached result (FREE)")
|
||||
return cached_result
|
||||
|
||||
# Extract if not cached
|
||||
print("🔄 Extracting with LLM (PAID)")
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
cache.cache_result(cache_key, data)
|
||||
return data
|
||||
|
||||
# Cached extraction avoids repeated costs
|
||||
result = await cached_llm_extraction(url, strategy)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Error Handling and Quality Control
|
||||
|
||||
### Validation and Retry Logic
|
||||
|
||||
```python
|
||||
async def robust_llm_extraction():
|
||||
"""Implement validation and retry for LLM extraction"""
|
||||
|
||||
max_retries = 3
|
||||
strategies = [
|
||||
# Try cheap model first
|
||||
LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
|
||||
schema=YourModel.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract data accurately..."
|
||||
),
|
||||
# Fallback to better model
|
||||
LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(provider="openai/gpt-4o", api_token="env:OPENAI_API_KEY"),
|
||||
schema=YourModel.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract data with high accuracy..."
|
||||
)
|
||||
]
|
||||
|
||||
for strategy_idx, strategy in enumerate(strategies):
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
if result.success and result.extracted_content:
|
||||
data = json.loads(result.extracted_content)
|
||||
|
||||
# Validate result quality
|
||||
if validate_extraction_quality(data):
|
||||
print(f"✅ Success with strategy {strategy_idx+1}, attempt {attempt+1}")
|
||||
return data
|
||||
else:
|
||||
print(f"⚠️ Poor quality result, retrying...")
|
||||
continue
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Attempt {attempt+1} failed: {e}")
|
||||
if attempt == max_retries - 1:
|
||||
print(f"❌ Strategy {strategy_idx+1} failed completely")
|
||||
|
||||
print("❌ All strategies and retries failed")
|
||||
return None
|
||||
|
||||
def validate_extraction_quality(data):
|
||||
"""Validate that LLM extraction meets quality standards"""
|
||||
if not data or not isinstance(data, (list, dict)):
|
||||
return False
|
||||
|
||||
# Check for common LLM extraction issues
|
||||
if isinstance(data, list):
|
||||
if len(data) == 0:
|
||||
return False
|
||||
|
||||
# Check if all items have required fields
|
||||
for item in data:
|
||||
if not isinstance(item, dict) or len(item) < 2:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
# Robust extraction with validation
|
||||
result = await robust_llm_extraction()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Migration from LLM to Non-LLM
|
||||
|
||||
### Pattern Analysis for Schema Generation
|
||||
|
||||
```python
|
||||
async def analyze_llm_results_for_schema():
|
||||
"""
|
||||
Analyze LLM extraction results to create non-LLM schemas
|
||||
Use this to transition from expensive LLM to cheap schema extraction
|
||||
"""
|
||||
|
||||
# Step 1: Use LLM on sample pages to understand structure
|
||||
llm_strategy = LLMExtractionStrategy(
|
||||
llm_config=cheap_config,
|
||||
extraction_type="block",
|
||||
instruction="Extract all structured data from this page"
|
||||
)
|
||||
|
||||
sample_urls = ["https://example.com/page1", "https://example.com/page2"]
|
||||
llm_results = []
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
for url in sample_urls:
|
||||
config = CrawlerRunConfig(extraction_strategy=llm_strategy)
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
|
||||
if result.success:
|
||||
llm_results.append({
|
||||
"url": url,
|
||||
"html": result.cleaned_html,
|
||||
"extracted": json.loads(result.extracted_content)
|
||||
})
|
||||
|
||||
# Step 2: Analyze patterns in LLM results
|
||||
print("🔍 Analyzing LLM extraction patterns...")
|
||||
|
||||
# Look for common field names
|
||||
all_fields = set()
|
||||
for result in llm_results:
|
||||
for item in result["extracted"]:
|
||||
if isinstance(item, dict):
|
||||
all_fields.update(item.keys())
|
||||
|
||||
print(f"Common fields found: {all_fields}")
|
||||
|
||||
# Step 3: Generate schema based on patterns
|
||||
if llm_results:
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=llm_results[0]["html"],
|
||||
target_json_example=json.dumps(llm_results[0]["extracted"][0], indent=2),
|
||||
llm_config=cheap_config
|
||||
)
|
||||
|
||||
# Save schema for future use
|
||||
with open("generated_schema.json", "w") as f:
|
||||
json.dump(schema, f, indent=2)
|
||||
|
||||
print("✅ Schema generated from LLM analysis")
|
||||
return schema
|
||||
|
||||
# Generate schema from LLM patterns, then use schema for all future extractions
|
||||
schema = await analyze_llm_results_for_schema()
|
||||
fast_strategy = JsonCssExtractionStrategy(schema)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Summary: When LLM is Actually Needed
|
||||
|
||||
### ✅ Valid LLM Use Cases (Rare):
|
||||
1. **Sentiment analysis** and emotional understanding
|
||||
2. **Knowledge graph extraction** requiring semantic reasoning
|
||||
3. **Content summarization** and insight generation
|
||||
4. **Unstructured text analysis** where patterns vary dramatically
|
||||
5. **Research paper analysis** requiring domain expertise
|
||||
6. **Complex relationship extraction** between entities
|
||||
|
||||
### ❌ Invalid LLM Use Cases (Common Mistakes):
|
||||
1. **Structured data extraction** from consistent HTML
|
||||
2. **Simple pattern matching** (emails, prices, dates)
|
||||
3. **Product information** from e-commerce sites
|
||||
4. **News article extraction** with consistent structure
|
||||
5. **Contact information** and basic entity extraction
|
||||
6. **Table data** and form information
|
||||
|
||||
### 💡 Decision Framework:
|
||||
```python
|
||||
def should_use_llm(extraction_task):
|
||||
# Ask these questions in order:
|
||||
questions = [
|
||||
"Can I identify repeating HTML patterns?", # No → Consider LLM
|
||||
"Am I extracting simple data types?", # Yes → Use Regex
|
||||
"Does the structure vary dramatically?", # No → Use CSS/XPath
|
||||
"Do I need semantic understanding?", # Yes → Maybe LLM
|
||||
"Have I tried generate_schema()?" # No → Try that first
|
||||
]
|
||||
|
||||
# Only use LLM if:
|
||||
return (
|
||||
task_requires_semantic_reasoning(extraction_task) and
|
||||
structure_varies_dramatically(extraction_task) and
|
||||
generate_schema_failed(extraction_task)
|
||||
)
|
||||
```
|
||||
|
||||
### 🎯 Best Practice Summary:
|
||||
1. **Always start** with [extraction-no-llm.md](./extraction-no-llm.md) strategies
|
||||
2. **Try generate_schema()** before manual schema creation
|
||||
3. **Use LLM sparingly** and only for semantic understanding
|
||||
4. **Monitor costs** and implement budget controls
|
||||
5. **Cache results** to avoid repeated LLM calls
|
||||
6. **Validate quality** of LLM extractions
|
||||
7. **Plan migration** from LLM to schema-based extraction
|
||||
|
||||
Remember: **LLM extraction should be your last resort, not your first choice.**
|
||||
|
||||
---
|
||||
|
||||
**📖 Recommended Reading Order:**
|
||||
1. [extraction-no-llm.md](./extraction-no-llm.md) - Start here for 99% of use cases
|
||||
2. This document - Only when non-LLM strategies are insufficient
|
||||