Compare commits

..

20 Commits

Author SHA1 Message Date
ntohidi
ee25c771d8 feat(cli): add deep crawling options with configurable strategies and max pages. ref #874 2025-07-02 14:07:23 +02:00
UncleCode
a353515271 feat: Add virtual scroll support for modern web scraping
Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc).

Key features:
- New VirtualScrollConfig class for configuring virtual scroll behavior
- Automatic detection of three scrolling scenarios: no change, content appended, content replaced
- Intelligent HTML chunk capture and merging with deduplication
- 100% content capture from virtual scroll pages
- Seamless integration with existing extraction strategies
- JavaScript-based detection and capture for performance
- Tree-based DOM merging with text-based deduplication

Documentation:
- Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md
- API reference updates in parameters.md and page-interaction.md
- Blog article explaining the solution and techniques
- Complete examples with local test server

Testing:
- Full test suite achieving 100% capture of 1000 items
- Examples for Twitter timeline, Instagram grid scenarios
- Local test server with different scrolling behaviors

This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.
2025-06-29 20:41:37 +08:00
UncleCode
539a324cf6 refactor(link_extractor): remove link_extractor and rename to link_preview
This change removes the link_extractor module and renames it to link_preview, streamlining the codebase. The removal of 395 lines of code reduces complexity and improves maintainability. Other files have been updated to reflect this change, ensuring consistency across the project.

BREAKING CHANGE: The link_extractor module has been deleted and replaced with link_preview. Update imports accordingly.
2025-06-27 21:54:22 +08:00
UncleCode
5c9c305dbf feat: Add advanced link head extraction with three-layer scoring system (#1)
Squashed commit from feature/link-extractor branch implementing comprehensive link analysis:

- Extract HTML head content from discovered links with parallel processing
- Three-layer scoring: Intrinsic (URL quality), Contextual (BM25), and Total scores
- New LinkExtractionConfig class for type-safe configuration
- Pattern-based filtering for internal/external links
- Comprehensive documentation and examples
2025-06-27 20:06:04 +08:00
UncleCode
e528086341 test(async_assistant): add new tests for extract pipeline
Introduced two new test files to enhance coverage for the extract pipeline functionality. The tests aim to validate the behavior of the pipeline under various scenarios, ensuring robustness and reliability.

No breaking changes. Closes issue #123.
2025-06-23 10:44:27 +08:00
UncleCode
c0fd36982d Update all documentation to import extraction strategies directly from crawl4ai. 2025-06-10 18:08:27 +08:00
UncleCode
cab457e9c7 Merge branch 'next' of https://github.com/unclecode/crawl4ai into next 2025-06-10 15:54:20 +08:00
UncleCode
2a0c0ed18d chore(deps): add httpx extras (#1195) 2025-06-10 15:47:03 +08:00
UncleCode
c73a130c50 Set memory_wait_timeout default to 10 minutes (#1193) 2025-06-10 15:47:03 +08:00
UncleCode
ef6f4329fa Add use_stemming option to BM25ContentFilter (#1192) 2025-06-10 15:44:45 +08:00
UncleCode
4eb90b41b6 Refactor Crawl4AI Assistant: Rename Schema Builder to Click2Crawl, update UI elements, and remove deprecated files
- Updated overlay.css to add gap in titlebar.
- Deleted schemaBuilder_v1.js and associated zip files (v1.0.0 to v1.2.0).
- Modified index.html to reflect new Click2Crawl feature and updated descriptions.
- Updated manifest.json to include new JavaScript files for Click2Crawl and markdown extraction.
- Refined popup styles and HTML to align with new feature names and functionalities.
- Enhanced user instructions and tooltips to guide users on the new Click2Crawl and Markdown Extraction features.
2025-06-10 15:40:26 +08:00
UncleCode
0ac12da9f3 feat: Major Chrome Extension overhaul with Click2Crawl, instant Schema extraction, and modular architecture
 New Features:
- Click2Crawl: Visual element selection with markdown conversion
  - Ctrl/Cmd+Click to select multiple elements
  - Visual text mode for WYSIWYG extraction
  - Real-time markdown preview with syntax highlighting
  - Export to .md file or clipboard

- Schema Builder Enhancement: Instant data extraction without LLMs
  - Test schemas directly in browser
  - See JSON results immediately
  - Export data or Python code
  - Cloud deployment ready (coming soon)

- Modular Architecture:
  - Separated into schemaBuilder.js, scriptBuilder.js, click2CrawlBuilder.js
  - Added contentAnalyzer.js and markdownConverter.js modules
  - Shared utilities and CSS reset system
  - Integrated marked.js for markdown rendering

🎨 UI/UX Improvements:
- Added edgy cloud announcement banner with seamless shimmer animation
- Direct, technical copy: "You don't need Puppeteer. You need Crawl4AI Cloud."
- Enhanced feature cards with emojis
- Fixed CSS conflicts with targeted reset approach
- Improved badge hover effects (red on hover)
- Added wrap toggle for code preview

📚 Documentation Updates:
- Split extraction diagrams into LLM and no-LLM versions
- Updated llms-full.txt with latest content
- Added versioned LLM context (v0.1.1)

🔧 Technical Enhancements:
- Refactored 3464 lines of monolithic content.js into modules
- Added proper event handling and cleanup
- Improved z-index management
- Better scroll position tracking for badges
- Enhanced error handling throughout

This release transforms the Chrome Extension from a simple tool into a powerful
visual data extraction suite, making web scraping accessible to everyone.
2025-06-09 23:18:27 +08:00
UncleCode
40640badad feat: add Script Builder to Chrome Extension and reorganize LLM context files
This commit introduces significant enhancements to the Crawl4AI ecosystem:

  Chrome Extension - Script Builder (Alpha):
  - Add recording functionality to capture user interactions (clicks, typing, scrolling)
  - Implement smart event grouping for cleaner script generation
  - Support export to both JavaScript and C4A script formats
  - Add timeline view for visualizing and editing recorded actions
  - Include wait commands (time-based and element-based)
  - Add saved flows functionality for reusing automation scripts
  - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents)
  - Release new extension versions: v1.1.0, v1.2.0, v1.2.1

  LLM Context Builder Improvements:
  - Reorganize context files from llmtxt/ to llm.txt/ with better structure
  - Separate diagram templates from text content (diagrams/ and txt/ subdirectories)
  - Add comprehensive context files for all major Crawl4AI components
  - Improve file naming convention for better discoverability

  Documentation Updates:
  - Update apps index page to match main documentation theme
  - Standardize color scheme: "Available" tags use primary color (#50ffff)
  - Change "Coming Soon" tags to dark gray for better visual hierarchy
  - Add interactive two-column layout for extension landing page
  - Include code examples for both Schema Builder and Script Builder features

  Technical Improvements:
  - Enhance event capture mechanism with better element selection
  - Add support for contenteditable elements and complex form interactions
  - Implement proper scroll event handling for both window and element scrolling
  - Add meta key support for keyboard shortcuts
  - Improve selector generation for more reliable element targeting

  The Script Builder is released as Alpha, acknowledging potential bugs while providing
  early access to this powerful automation recording feature.
2025-06-08 22:02:12 +08:00
UncleCode
926592649e Add Crawl4AI Assistant Chrome Extension
- Created manifest.json for the Crawl4AI Assistant extension.
- Added popup HTML, CSS, and JS files for the extension interface.
- Included icons and favicon for the extension.
- Implemented functionality for schema capture and code generation.
- Updated index.md to reflect the availability of the new extension.
- Enhanced LLM Context Builder layout and styles for consistency.
- Adjusted global styles for better branding and responsiveness.
2025-06-08 18:34:05 +08:00
UncleCode
b870bfdb6c chore(deps): add httpx extras (#1195) 2025-06-08 16:06:38 +08:00
UncleCode
6f3a0ea38e Create "Apps" section in documentation and Add interactive c4a-script playground and LLM context builder for Crawl4AI
- Created a new HTML page (`index.html`) for the interactive LLM context builder, allowing users to select and combine different `crawl4ai` context files.
- Implemented JavaScript functionality (`llmtxt.js`) to manage component selection, context types, and file downloads.
- Added CSS styles (`llmtxt.css`) for a terminal-themed UI.
- Introduced a new Markdown file (`build.md`) detailing the requirements and functionality of the context builder.
- Updated the navigation in `mkdocs.yml` to include links to the new context builder and demo apps.
- Added a new Markdown file (`why.md`) explaining the motivation behind the new context structure and its benefits for AI coding assistants.
2025-06-08 15:48:17 +08:00
UncleCode
451b0d6c9a Set memory_wait_timeout default to 10 minutes (#1193) 2025-06-08 13:53:09 +08:00
UncleCode
08a2cdae53 Add C4A-Script support and documentation
- Generate OneShot js code geenrator
- Introduced a new C4A-Script tutorial example for login flow using Blockly.
- Updated index.html to include Blockly theme and event editor modal for script editing.
- Created a test HTML file for testing Blockly integration.
- Added comprehensive C4A-Script API reference documentation covering commands, syntax, and examples.
- Developed core documentation for C4A-Script, detailing its features, commands, and real-world examples.
- Updated mkdocs.yml to include new C4A-Script documentation in navigation.
2025-06-07 23:07:19 +08:00
UncleCode
ca03acbc82 Add some new commands for the Crawl4ai script transpiler and creating an interactive tutorial that allows users to go through multiple steps and apply the syntax to automate the page. Fixed some issues and add several new commands for setting input values, variables, clearing input fields, and more. 2025-06-06 23:03:26 +08:00
UncleCode
3f6f2e998c feat(script): add new scripting capabilities and documentation
This commit introduces a comprehensive set of new scripts and examples to enhance the scripting capabilities of the crawl4ai project. The changes include the addition of several Python scripts for compiling and executing scripts, as well as a variety of example scripts demonstrating different functionalities such as login flows, data extraction, and multi-step workflows. Additionally, detailed documentation has been created to guide users on how to utilize these new features effectively.

The following significant modifications were made:
- Added core scripting files: , , and .
- Created a new documentation file  to provide an overview of the new features.
- Introduced multiple example scripts in the  directory to showcase various use cases.
- Updated  and  to integrate the new functionalities.
- Added font assets for improved documentation presentation.

These changes significantly expand the functionality of the crawl4ai project, allowing users to create more complex and varied scripts with ease.
2025-06-06 17:16:53 +08:00
255 changed files with 82824 additions and 101294 deletions

View File

@@ -1,3 +1,28 @@
{
"permissions": {
"allow": [
"Bash(cd:*)",
"Bash(python3:*)",
"Bash(python:*)",
"Bash(grep:*)",
"Bash(mkdir:*)",
"Bash(cp:*)",
"Bash(rm:*)",
"Bash(true)",
"Bash(./package-extension.sh:*)",
"Bash(find:*)",
"Bash(chmod:*)",
"Bash(rg:*)",
"Bash(/Users/unclecode/.npm-global/lib/node_modules/@anthropic-ai/claude-code/vendor/ripgrep/arm64-darwin/rg -A 5 -B 5 \"Script Builder\" docs/md_v2/apps/crawl4ai-assistant/)",
"Bash(/Users/unclecode/.npm-global/lib/node_modules/@anthropic-ai/claude-code/vendor/ripgrep/arm64-darwin/rg -A 30 \"generateCode\\(events, format\\)\" docs/md_v2/apps/crawl4ai-assistant/content/content.js)",
"Bash(/Users/unclecode/.npm-global/lib/node_modules/@anthropic-ai/claude-code/vendor/ripgrep/arm64-darwin/rg \"<style>\" docs/md_v2/apps/crawl4ai-assistant/index.html -A 5)",
"Bash(git checkout:*)",
"Bash(docker logs:*)",
"Bash(curl:*)",
"Bash(docker compose:*)",
"Bash(./test-final-integration.sh:*)",
"Bash(mv:*)"
]
},
"enableAllProjectMcpServers": false
}

View File

@@ -5,6 +5,20 @@ All notable changes to Crawl4AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.7.x] - 2025-06-29
### Added
- **Virtual Scroll Support**: New `VirtualScrollConfig` for handling virtualized scrolling on modern websites
- Automatically detects and handles three scrolling scenarios:
- Content unchanged (continue scrolling)
- Content appended (traditional infinite scroll)
- Content replaced (true virtual scroll - Twitter/Instagram style)
- Captures ALL content from pages that replace DOM elements during scroll
- Intelligent deduplication based on normalized text content
- Configurable scroll amount, count, and wait times
- Seamless integration with existing extraction strategies
- Comprehensive examples including Twitter timeline, Instagram grid, and mixed content scenarios
## [Unreleased]
### Added

View File

@@ -352,7 +352,7 @@ if __name__ == "__main__":
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
import json
async def main():
@@ -426,7 +426,7 @@ if __name__ == "__main__":
import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):

View File

@@ -2,8 +2,8 @@
import warnings
from .async_webcrawler import AsyncWebCrawler, CacheMode
# MODIFIED: Add SeedingConfig here
from .async_configs import BrowserConfig, CrawlerRunConfig, HTTPCrawlerConfig, LLMConfig, ProxyConfig, GeolocationConfig, SeedingConfig
# MODIFIED: Add SeedingConfig and VirtualScrollConfig here
from .async_configs import BrowserConfig, CrawlerRunConfig, HTTPCrawlerConfig, LLMConfig, ProxyConfig, GeolocationConfig, SeedingConfig, VirtualScrollConfig
from .content_scraping_strategy import (
ContentScrapingStrategy,
@@ -37,6 +37,7 @@ from .content_filter_strategy import (
)
from .models import CrawlResult, MarkdownGenerationResult, DisplayMode
from .components.crawler_monitor import CrawlerMonitor
from .link_preview import LinkPreview
from .async_dispatcher import (
MemoryAdaptiveDispatcher,
SemaphoreDispatcher,
@@ -69,6 +70,16 @@ from .deep_crawling import (
# NEW: Import AsyncUrlSeeder
from .async_url_seeder import AsyncUrlSeeder
# C4A Script Language Support
from .script import (
compile as c4a_compile,
validate as c4a_validate,
compile_file as c4a_compile_file,
CompilationResult,
ValidationResult,
ErrorDetail
)
from .utils import (
start_colab_display_server,
setup_colab_environment
@@ -81,8 +92,9 @@ __all__ = [
"BrowserProfiler",
"LLMConfig",
"GeolocationConfig",
# NEW: Add SeedingConfig
# NEW: Add SeedingConfig and VirtualScrollConfig
"SeedingConfig",
"VirtualScrollConfig",
# NEW: Add AsyncUrlSeeder
"AsyncUrlSeeder",
"DeepCrawlStrategy",
@@ -131,6 +143,7 @@ __all__ = [
"SemaphoreDispatcher",
"RateLimiter",
"CrawlerMonitor",
"LinkPreview",
"DisplayMode",
"MarkdownGenerationResult",
"Crawl4aiDockerClient",
@@ -139,6 +152,13 @@ __all__ = [
"ProxyConfig",
"start_colab_display_server",
"setup_colab_environment",
# C4A Script additions
"c4a_compile",
"c4a_validate",
"c4a_compile_file",
"CompilationResult",
"ValidationResult",
"ErrorDetail",
]

View File

@@ -1,4 +1,5 @@
import os
from typing import Union
from .config import (
DEFAULT_PROVIDER,
DEFAULT_PROVIDER_API_KEY,
@@ -17,7 +18,7 @@ from .extraction_strategy import ExtractionStrategy, LLMExtractionStrategy
from .chunking_strategy import ChunkingStrategy, RegexChunking
from .markdown_generation_strategy import MarkdownGenerationStrategy, DefaultMarkdownGenerator
from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy
from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy, LXMLWebScrapingStrategy
from .deep_crawling import DeepCrawlStrategy
from .cache_context import CacheMode
@@ -594,6 +595,146 @@ class BrowserConfig:
return config
return BrowserConfig.from_kwargs(config)
class VirtualScrollConfig:
"""Configuration for virtual scroll handling.
This config enables capturing content from pages with virtualized scrolling
(like Twitter, Instagram feeds) where DOM elements are recycled as user scrolls.
"""
def __init__(
self,
container_selector: str,
scroll_count: int = 10,
scroll_by: Union[str, int] = "container_height",
wait_after_scroll: float = 0.5,
):
"""
Initialize virtual scroll configuration.
Args:
container_selector: CSS selector for the scrollable container
scroll_count: Maximum number of scrolls to perform
scroll_by: Amount to scroll - can be:
- "container_height": scroll by container's height
- "page_height": scroll by viewport height
- int: fixed pixel amount
wait_after_scroll: Seconds to wait after each scroll for content to load
"""
self.container_selector = container_selector
self.scroll_count = scroll_count
self.scroll_by = scroll_by
self.wait_after_scroll = wait_after_scroll
def to_dict(self) -> dict:
"""Convert to dictionary for serialization."""
return {
"container_selector": self.container_selector,
"scroll_count": self.scroll_count,
"scroll_by": self.scroll_by,
"wait_after_scroll": self.wait_after_scroll,
}
@classmethod
def from_dict(cls, data: dict) -> "VirtualScrollConfig":
"""Create instance from dictionary."""
return cls(**data)
class LinkPreviewConfig:
"""Configuration for link head extraction and scoring."""
def __init__(
self,
include_internal: bool = True,
include_external: bool = False,
include_patterns: Optional[List[str]] = None,
exclude_patterns: Optional[List[str]] = None,
concurrency: int = 10,
timeout: int = 5,
max_links: int = 100,
query: Optional[str] = None,
score_threshold: Optional[float] = None,
verbose: bool = False
):
"""
Initialize link extraction configuration.
Args:
include_internal: Whether to include same-domain links
include_external: Whether to include different-domain links
include_patterns: List of glob patterns to include (e.g., ["*/docs/*", "*/api/*"])
exclude_patterns: List of glob patterns to exclude (e.g., ["*/login*", "*/admin*"])
concurrency: Number of links to process simultaneously
timeout: Timeout in seconds for each link's head extraction
max_links: Maximum number of links to process (prevents overload)
query: Query string for BM25 contextual scoring (optional)
score_threshold: Minimum relevance score to include links (0.0-1.0, optional)
verbose: Show detailed progress during extraction
"""
self.include_internal = include_internal
self.include_external = include_external
self.include_patterns = include_patterns
self.exclude_patterns = exclude_patterns
self.concurrency = concurrency
self.timeout = timeout
self.max_links = max_links
self.query = query
self.score_threshold = score_threshold
self.verbose = verbose
# Validation
if concurrency <= 0:
raise ValueError("concurrency must be positive")
if timeout <= 0:
raise ValueError("timeout must be positive")
if max_links <= 0:
raise ValueError("max_links must be positive")
if score_threshold is not None and not (0.0 <= score_threshold <= 1.0):
raise ValueError("score_threshold must be between 0.0 and 1.0")
if not include_internal and not include_external:
raise ValueError("At least one of include_internal or include_external must be True")
@staticmethod
def from_dict(config_dict: Dict[str, Any]) -> "LinkPreviewConfig":
"""Create LinkPreviewConfig from dictionary (for backward compatibility)."""
if not config_dict:
return None
return LinkPreviewConfig(
include_internal=config_dict.get("include_internal", True),
include_external=config_dict.get("include_external", False),
include_patterns=config_dict.get("include_patterns"),
exclude_patterns=config_dict.get("exclude_patterns"),
concurrency=config_dict.get("concurrency", 10),
timeout=config_dict.get("timeout", 5),
max_links=config_dict.get("max_links", 100),
query=config_dict.get("query"),
score_threshold=config_dict.get("score_threshold"),
verbose=config_dict.get("verbose", False)
)
def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary format."""
return {
"include_internal": self.include_internal,
"include_external": self.include_external,
"include_patterns": self.include_patterns,
"exclude_patterns": self.exclude_patterns,
"concurrency": self.concurrency,
"timeout": self.timeout,
"max_links": self.max_links,
"query": self.query,
"score_threshold": self.score_threshold,
"verbose": self.verbose
}
def clone(self, **kwargs) -> "LinkPreviewConfig":
"""Create a copy with updated values."""
config_dict = self.to_dict()
config_dict.update(kwargs)
return LinkPreviewConfig.from_dict(config_dict)
class HTTPCrawlerConfig:
"""HTTP-specific crawler configuration"""
@@ -816,6 +957,12 @@ class CrawlerRunConfig():
table_score_threshold (int): Minimum score threshold for processing a table.
Default: 7.
# Virtual Scroll Parameters
virtual_scroll_config (VirtualScrollConfig or dict or None): Configuration for handling virtual scroll containers.
Used for capturing content from pages with virtualized
scrolling (e.g., Twitter, Instagram feeds).
Default: None.
# Link and Domain Handling Parameters
exclude_social_media_domains (list of str): List of domains to exclude for social media links.
Default: SOCIAL_MEDIA_DOMAINS (from config).
@@ -829,6 +976,9 @@ class CrawlerRunConfig():
Default: [].
exclude_internal_links (bool): If True, exclude internal links from the results.
Default: False.
score_links (bool): If True, calculate intrinsic quality scores for all links using URL structure,
text quality, and contextual relevance metrics. Separate from link_preview_config.
Default: False.
# Debugging and Logging Parameters
verbose (bool): Enable verbose logging.
@@ -911,6 +1061,7 @@ class CrawlerRunConfig():
semaphore_count: int = 5,
# Page Interaction Parameters
js_code: Union[str, List[str]] = None,
c4a_script: Union[str, List[str]] = None,
js_only: bool = False,
ignore_body_visibility: bool = True,
scan_full_page: bool = False,
@@ -938,6 +1089,7 @@ class CrawlerRunConfig():
exclude_social_media_links: bool = False,
exclude_domains: list = None,
exclude_internal_links: bool = False,
score_links: bool = False,
# Debugging and Logging Parameters
verbose: bool = True,
log_console: bool = False,
@@ -954,6 +1106,10 @@ class CrawlerRunConfig():
user_agent_generator_config: dict = {},
# Deep Crawl Parameters
deep_crawl_strategy: Optional[DeepCrawlStrategy] = None,
# Link Extraction Parameters
link_preview_config: Union[LinkPreviewConfig, Dict[str, Any]] = None,
# Virtual Scroll Parameters
virtual_scroll_config: Union[VirtualScrollConfig, Dict[str, Any]] = None,
# Experimental Parameters
experimental: Dict[str, Any] = None,
):
@@ -975,7 +1131,7 @@ class CrawlerRunConfig():
self.remove_forms = remove_forms
self.prettiify = prettiify
self.parser_type = parser_type
self.scraping_strategy = scraping_strategy or WebScrapingStrategy()
self.scraping_strategy = scraping_strategy or LXMLWebScrapingStrategy()
self.proxy_config = proxy_config
self.proxy_rotation_strategy = proxy_rotation_strategy
@@ -1009,6 +1165,7 @@ class CrawlerRunConfig():
# Page Interaction Parameters
self.js_code = js_code
self.c4a_script = c4a_script
self.js_only = js_only
self.ignore_body_visibility = ignore_body_visibility
self.scan_full_page = scan_full_page
@@ -1040,6 +1197,7 @@ class CrawlerRunConfig():
self.exclude_social_media_links = exclude_social_media_links
self.exclude_domains = exclude_domains or []
self.exclude_internal_links = exclude_internal_links
self.score_links = score_links
# Debugging and Logging Parameters
self.verbose = verbose
@@ -1082,8 +1240,83 @@ class CrawlerRunConfig():
# Deep Crawl Parameters
self.deep_crawl_strategy = deep_crawl_strategy
# Link Extraction Parameters
if link_preview_config is None:
self.link_preview_config = None
elif isinstance(link_preview_config, LinkPreviewConfig):
self.link_preview_config = link_preview_config
elif isinstance(link_preview_config, dict):
# Convert dict to config object for backward compatibility
self.link_preview_config = LinkPreviewConfig.from_dict(link_preview_config)
else:
raise ValueError("link_preview_config must be LinkPreviewConfig object or dict")
# Virtual Scroll Parameters
if virtual_scroll_config is None:
self.virtual_scroll_config = None
elif isinstance(virtual_scroll_config, VirtualScrollConfig):
self.virtual_scroll_config = virtual_scroll_config
elif isinstance(virtual_scroll_config, dict):
# Convert dict to config object for backward compatibility
self.virtual_scroll_config = VirtualScrollConfig.from_dict(virtual_scroll_config)
else:
raise ValueError("virtual_scroll_config must be VirtualScrollConfig object or dict")
# Experimental Parameters
self.experimental = experimental or {}
# Compile C4A scripts if provided
if self.c4a_script and not self.js_code:
self._compile_c4a_script()
def _compile_c4a_script(self):
"""Compile C4A script to JavaScript"""
try:
# Try importing the compiler
try:
from .script import compile
except ImportError:
from crawl4ai.script import compile
# Handle both string and list inputs
if isinstance(self.c4a_script, str):
scripts = [self.c4a_script]
else:
scripts = self.c4a_script
# Compile each script
compiled_js = []
for i, script in enumerate(scripts):
result = compile(script)
if result.success:
compiled_js.extend(result.js_code)
else:
# Format error message following existing patterns
error = result.first_error
error_msg = (
f"C4A Script compilation error (script {i+1}):\n"
f" Line {error.line}, Column {error.column}: {error.message}\n"
f" Code: {error.source_line}"
)
if error.suggestions:
error_msg += f"\n Suggestion: {error.suggestions[0].message}"
raise ValueError(error_msg)
self.js_code = compiled_js
except ImportError:
raise ValueError(
"C4A script compiler not available. "
"Please ensure crawl4ai.script module is properly installed."
)
except Exception as e:
# Re-raise with context
if "compilation error" not in str(e).lower():
raise ValueError(f"Failed to compile C4A script: {str(e)}")
raise
def __getattr__(self, name):
@@ -1186,6 +1419,7 @@ class CrawlerRunConfig():
exclude_social_media_links=kwargs.get("exclude_social_media_links", False),
exclude_domains=kwargs.get("exclude_domains", []),
exclude_internal_links=kwargs.get("exclude_internal_links", False),
score_links=kwargs.get("score_links", False),
# Debugging and Logging Parameters
verbose=kwargs.get("verbose", True),
log_console=kwargs.get("log_console", False),
@@ -1201,6 +1435,8 @@ class CrawlerRunConfig():
user_agent_generator_config=kwargs.get("user_agent_generator_config", {}),
# Deep Crawl Parameters
deep_crawl_strategy=kwargs.get("deep_crawl_strategy"),
# Link Extraction Parameters
link_preview_config=kwargs.get("link_preview_config"),
url=kwargs.get("url"),
# Experimental Parameters
experimental=kwargs.get("experimental"),
@@ -1284,6 +1520,7 @@ class CrawlerRunConfig():
"exclude_social_media_links": self.exclude_social_media_links,
"exclude_domains": self.exclude_domains,
"exclude_internal_links": self.exclude_internal_links,
"score_links": self.score_links,
"verbose": self.verbose,
"log_console": self.log_console,
"capture_network_requests": self.capture_network_requests,
@@ -1295,6 +1532,7 @@ class CrawlerRunConfig():
"user_agent_mode": self.user_agent_mode,
"user_agent_generator_config": self.user_agent_generator_config,
"deep_crawl_strategy": self.deep_crawl_strategy,
"link_preview_config": self.link_preview_config.to_dict() if self.link_preview_config else None,
"url": self.url,
"experimental": self.experimental,
}

View File

@@ -898,6 +898,10 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
if config.scan_full_page:
await self._handle_full_page_scan(page, config.scroll_delay)
# Handle virtual scroll if configured
if config.virtual_scroll_config:
await self._handle_virtual_scroll(page, config.virtual_scroll_config)
# Execute JavaScript if provided
# if config.js_code:
# if isinstance(config.js_code, str):
@@ -1149,6 +1153,177 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await self.safe_scroll(page, 0, total_height)
async def _handle_virtual_scroll(self, page: Page, config: "VirtualScrollConfig"):
"""
Handle virtual scroll containers (e.g., Twitter-like feeds) by capturing
content at different scroll positions and merging unique elements.
Following the design:
1. Get container HTML
2. Scroll by container height
3. Wait and check if container HTML changed
4. Three cases:
- No change: continue scrolling
- New items added (appended): continue (items already in page)
- Items replaced: capture HTML chunk and add to list
5. After N scrolls, merge chunks if any were captured
Args:
page: The Playwright page object
config: Virtual scroll configuration
"""
try:
# Import VirtualScrollConfig to avoid circular import
from .async_configs import VirtualScrollConfig
# Ensure config is a VirtualScrollConfig instance
if isinstance(config, dict):
config = VirtualScrollConfig.from_dict(config)
self.logger.info(
message="Starting virtual scroll capture for container: {selector}",
tag="VSCROLL",
params={"selector": config.container_selector}
)
# JavaScript function to handle virtual scroll capture
virtual_scroll_js = """
async (config) => {
const container = document.querySelector(config.container_selector);
if (!container) {
throw new Error(`Container not found: ${config.container_selector}`);
}
// List to store HTML chunks when content is replaced
const htmlChunks = [];
let previousHTML = container.innerHTML;
let scrollCount = 0;
// Determine scroll amount
let scrollAmount;
if (typeof config.scroll_by === 'number') {
scrollAmount = config.scroll_by;
} else if (config.scroll_by === 'page_height') {
scrollAmount = window.innerHeight;
} else { // container_height
scrollAmount = container.offsetHeight;
}
// Perform scrolling
while (scrollCount < config.scroll_count) {
// Scroll the container
container.scrollTop += scrollAmount;
// Wait for content to potentially load
await new Promise(resolve => setTimeout(resolve, config.wait_after_scroll * 1000));
// Get current HTML
const currentHTML = container.innerHTML;
// Determine what changed
if (currentHTML === previousHTML) {
// Case 0: No change - continue scrolling
console.log(`Scroll ${scrollCount + 1}: No change in content`);
} else if (currentHTML.startsWith(previousHTML)) {
// Case 1: New items appended - content already in page
console.log(`Scroll ${scrollCount + 1}: New items appended`);
} else {
// Case 2: Items replaced - capture the previous HTML
console.log(`Scroll ${scrollCount + 1}: Content replaced, capturing chunk`);
htmlChunks.push(previousHTML);
}
// Update previous HTML for next iteration
previousHTML = currentHTML;
scrollCount++;
// Check if we've reached the end
if (container.scrollTop + container.clientHeight >= container.scrollHeight - 10) {
console.log(`Reached end of scrollable content at scroll ${scrollCount}`);
// Capture final chunk if content was replaced
if (htmlChunks.length > 0) {
htmlChunks.push(currentHTML);
}
break;
}
}
// If we have chunks (case 2 occurred), merge them
if (htmlChunks.length > 0) {
console.log(`Merging ${htmlChunks.length} HTML chunks`);
// Parse all chunks to extract unique elements
const tempDiv = document.createElement('div');
const seenTexts = new Set();
const uniqueElements = [];
// Process each chunk
for (const chunk of htmlChunks) {
tempDiv.innerHTML = chunk;
const elements = tempDiv.children;
for (let i = 0; i < elements.length; i++) {
const element = elements[i];
// Normalize text for deduplication
const normalizedText = element.innerText
.toLowerCase()
.replace(/[\\s\\W]/g, ''); // Remove spaces and symbols
if (!seenTexts.has(normalizedText)) {
seenTexts.add(normalizedText);
uniqueElements.push(element.outerHTML);
}
}
}
// Replace container content with merged unique elements
container.innerHTML = uniqueElements.join('\\n');
console.log(`Merged ${uniqueElements.length} unique elements from ${htmlChunks.length} chunks`);
return {
success: true,
chunksCount: htmlChunks.length,
uniqueCount: uniqueElements.length,
replaced: true
};
} else {
console.log('No content replacement detected, all content remains in page');
return {
success: true,
chunksCount: 0,
uniqueCount: 0,
replaced: false
};
}
}
"""
# Execute virtual scroll capture
result = await page.evaluate(virtual_scroll_js, config.to_dict())
if result.get("replaced", False):
self.logger.success(
message="Virtual scroll completed. Merged {unique} unique elements from {chunks} chunks",
tag="VSCROLL",
params={
"unique": result.get("uniqueCount", 0),
"chunks": result.get("chunksCount", 0)
}
)
else:
self.logger.info(
message="Virtual scroll completed. Content was appended, no merging needed",
tag="VSCROLL"
)
except Exception as e:
self.logger.error(
message="Virtual scroll capture failed: {error}",
tag="VSCROLL",
params={"error": str(e)}
)
# Continue with normal flow even if virtual scroll fails
async def _handle_download(self, download):
"""
Handle file downloads.

View File

@@ -109,12 +109,16 @@ def _parse_head(src: str) -> Dict[str, Any]:
elif "charset" in el.attrib:
info["charset"] = el.attrib["charset"].lower()
for el in doc.xpath(".//link"):
rel = " ".join(el.attrib.get("rel", [])).lower()
if not rel:
rel_attr = el.attrib.get("rel", "")
if not rel_attr:
continue
# Handle multiple space-separated rel values
rel_values = rel_attr.lower().split()
entry = {a: el.attrib[a] for a in (
"href", "as", "type", "hreflang") if a in el.attrib}
info["link"].setdefault(rel, []).append(entry)
# Add entry for each rel value
for rel in rel_values:
info["link"].setdefault(rel, []).append(entry)
# Extract JSON-LD structured data
for script in doc.xpath('.//script[@type="application/ld+json"]'):
if script.text:
@@ -467,6 +471,200 @@ class AsyncUrlSeeder:
"info", "Finished URL seeding for multiple domains.", tag="URL_SEED")
return final_results
async def extract_head_for_urls(
self,
urls: List[str],
config: Optional["SeedingConfig"] = None,
concurrency: int = 10,
timeout: int = 5
) -> List[Dict[str, Any]]:
"""
Extract head content for a custom list of URLs using URLSeeder's parallel processing.
This method reuses URLSeeder's efficient parallel processing, caching, and head extraction
logic to process a custom list of URLs rather than discovering URLs from sources.
Parameters
----------
urls : List[str]
List of URLs to extract head content from
config : SeedingConfig, optional
Configuration object. If None, uses default settings for head extraction
concurrency : int, default=10
Number of concurrent requests
timeout : int, default=5
Timeout for each request in seconds
Returns
-------
List[Dict[str, Any]]
List of dictionaries containing url, status, head_data, and optional relevance_score
"""
# Create default config if none provided
if config is None:
# Import here to avoid circular imports
from .async_configs import SeedingConfig
config = SeedingConfig(
extract_head=True,
concurrency=concurrency,
verbose=False
)
# Override concurrency and ensure head extraction is enabled
config.concurrency = concurrency
config.extract_head = True
self._log("info", "Starting head extraction for {count} custom URLs",
params={"count": len(urls)}, tag="URL_SEED")
# Setup rate limiting if specified in config
if config.hits_per_sec:
if config.hits_per_sec <= 0:
self._log("warning", "hits_per_sec must be positive. Disabling rate limiting.", tag="URL_SEED")
self._rate_sem = None
else:
self._rate_sem = asyncio.Semaphore(config.hits_per_sec)
else:
self._rate_sem = None
# Use bounded queue to prevent memory issues with large URL lists
queue_size = min(10000, max(1000, concurrency * 100))
queue = asyncio.Queue(maxsize=queue_size)
producer_done = asyncio.Event()
stop_event = asyncio.Event()
seen: set[str] = set()
# Results collection
results: List[Dict[str, Any]] = []
async def producer():
"""Producer to feed URLs into the queue."""
try:
for url in urls:
if url in seen:
self._log("debug", "Skipping duplicate URL: {url}",
params={"url": url}, tag="URL_SEED")
continue
if stop_event.is_set():
break
seen.add(url)
await queue.put(url)
finally:
producer_done.set()
async def worker(res_list: List[Dict[str, Any]]):
"""Worker to process URLs from the queue."""
while True:
try:
# Wait for URL or producer completion
url = await asyncio.wait_for(queue.get(), timeout=1.0)
except asyncio.TimeoutError:
if producer_done.is_set() and queue.empty():
break
continue
try:
# Use existing _validate method which handles head extraction, caching, etc.
await self._validate(
url, res_list,
live=False, # We're not doing live checks, just head extraction
extract=True, # Always extract head content
timeout=timeout,
verbose=config.verbose or False,
query=config.query,
score_threshold=config.score_threshold,
scoring_method=config.scoring_method or "bm25",
filter_nonsense=config.filter_nonsense_urls
)
except Exception as e:
self._log("error", "Failed to process URL {url}: {error}",
params={"url": url, "error": str(e)}, tag="URL_SEED")
# Add failed entry to results
res_list.append({
"url": url,
"status": "failed",
"head_data": {},
"error": str(e)
})
finally:
queue.task_done()
# Start producer
producer_task = asyncio.create_task(producer())
# Start workers
worker_tasks = []
for _ in range(concurrency):
worker_task = asyncio.create_task(worker(results))
worker_tasks.append(worker_task)
# Wait for producer to finish
await producer_task
# Wait for all items to be processed
await queue.join()
# Cancel workers
for task in worker_tasks:
task.cancel()
# Wait for workers to finish canceling
await asyncio.gather(*worker_tasks, return_exceptions=True)
# Apply BM25 scoring if query is provided
if config.query and config.scoring_method == "bm25":
results = await self._apply_bm25_scoring(results, config)
# Apply score threshold filtering
if config.score_threshold is not None:
results = [r for r in results if r.get("relevance_score", 0) >= config.score_threshold]
# Sort by relevance score if available
if any("relevance_score" in r for r in results):
results.sort(key=lambda x: x.get("relevance_score", 0), reverse=True)
self._log("info", "Completed head extraction for {count} URLs, {success} successful",
params={
"count": len(urls),
"success": len([r for r in results if r.get("status") == "valid"])
}, tag="URL_SEED")
return results
async def _apply_bm25_scoring(self, results: List[Dict[str, Any]], config: "SeedingConfig") -> List[Dict[str, Any]]:
"""Apply BM25 scoring to results that have head_data."""
if not HAS_BM25:
self._log("warning", "BM25 scoring requested but rank_bm25 not available", tag="URL_SEED")
return results
# Extract text contexts from head data
text_contexts = []
valid_results = []
for result in results:
if result.get("status") == "valid" and result.get("head_data"):
text_context = self._extract_text_context(result["head_data"])
if text_context:
text_contexts.append(text_context)
valid_results.append(result)
else:
# Use URL-based scoring as fallback
score = self._calculate_url_relevance_score(config.query, result["url"])
result["relevance_score"] = float(score)
elif result.get("status") == "valid":
# No head data but valid URL - use URL-based scoring
score = self._calculate_url_relevance_score(config.query, result["url"])
result["relevance_score"] = float(score)
# Calculate BM25 scores for results with text context
if text_contexts and valid_results:
scores = await asyncio.to_thread(self._calculate_bm25_score, config.query, text_contexts)
for i, result in enumerate(valid_results):
if i < len(scores):
result["relevance_score"] = float(scores[i])
return results
async def _resolve_head(self, url: str) -> Optional[str]:
"""
HEAD-probe a URL.

View File

@@ -502,9 +502,12 @@ class AsyncWebCrawler:
metadata = result.get("metadata", {})
else:
cleaned_html = sanitize_input_encode(result.cleaned_html)
media = result.media.model_dump()
tables = media.pop("tables", [])
links = result.links.model_dump()
# media = result.media.model_dump()
# tables = media.pop("tables", [])
# links = result.links.model_dump()
media = result.media.model_dump() if hasattr(result.media, 'model_dump') else result.media
tables = media.pop("tables", []) if isinstance(media, dict) else []
links = result.links.model_dump() if hasattr(result.links, 'model_dump') else result.links
metadata = result.metadata
fit_html = preprocess_html_for_schema(html_content=html, text_threshold= 500, max_size= 300_000)

View File

@@ -27,7 +27,10 @@ from crawl4ai import (
PruningContentFilter,
BrowserProfiler,
DefaultMarkdownGenerator,
LLMConfig
LLMConfig,
BFSDeepCrawlStrategy,
DFSDeepCrawlStrategy,
BestFirstCrawlingStrategy,
)
from crawl4ai.config import USER_SETTINGS
from litellm import completion
@@ -1010,13 +1013,15 @@ def cdp_cmd(user_data_dir: Optional[str], port: int, browser_type: str, headless
@click.option("--crawler", "-c", type=str, callback=parse_key_values, help="Crawler parameters as key1=value1,key2=value2")
@click.option("--output", "-o", type=click.Choice(["all", "json", "markdown", "md", "markdown-fit", "md-fit"]), default="all")
@click.option("--output-file", "-O", type=click.Path(), help="Output file path (default: stdout)")
@click.option("--bypass-cache", "-b", is_flag=True, default=True, help="Bypass cache when crawling")
@click.option("--bypass-cache", "-bc", is_flag=True, default=True, help="Bypass cache when crawling")
@click.option("--question", "-q", help="Ask a question about the crawled content")
@click.option("--verbose", "-v", is_flag=True)
@click.option("--profile", "-p", help="Use a specific browser profile (by name)")
@click.option("--deep-crawl", type=click.Choice(["bfs", "dfs", "best-first"]), help="Enable deep crawling with specified strategy (bfs, dfs, or best-first)")
@click.option("--max-pages", type=int, default=10, help="Maximum number of pages to crawl in deep crawl mode")
def crawl_cmd(url: str, browser_config: str, crawler_config: str, filter_config: str,
extraction_config: str, json_extract: str, schema: str, browser: Dict, crawler: Dict,
output: str, output_file: str, bypass_cache: bool, question: str, verbose: bool, profile: str):
output: str, output_file: str, bypass_cache: bool, question: str, verbose: bool, profile: str, deep_crawl: str, max_pages: int):
"""Crawl a website and extract content
Simple Usage:
@@ -1156,6 +1161,27 @@ Always return valid, properly formatted JSON."""
crawler_cfg.scraping_strategy = LXMLWebScrapingStrategy()
# Handle deep crawling configuration
if deep_crawl:
if deep_crawl == "bfs":
crawler_cfg.deep_crawl_strategy = BFSDeepCrawlStrategy(
max_depth=3,
max_pages=max_pages
)
elif deep_crawl == "dfs":
crawler_cfg.deep_crawl_strategy = DFSDeepCrawlStrategy(
max_depth=3,
max_pages=max_pages
)
elif deep_crawl == "best-first":
crawler_cfg.deep_crawl_strategy = BestFirstCrawlingStrategy(
max_depth=3,
max_pages=max_pages
)
if verbose:
console.print(f"[green]Deep crawling enabled:[/green] {deep_crawl} strategy, max {max_pages} pages")
config = get_global_config()
browser_cfg.verbose = config.get("VERBOSE", False)
@@ -1170,39 +1196,60 @@ Always return valid, properly formatted JSON."""
verbose
)
# Handle deep crawl results (list) vs single result
if isinstance(result, list):
if len(result) == 0:
click.echo("No results found during deep crawling")
return
# Use the first result for question answering and output
main_result = result[0]
all_results = result
else:
# Single result from regular crawling
main_result = result
all_results = [result]
# Handle question
if question:
provider, token = setup_llm_config()
markdown = result.markdown.raw_markdown
markdown = main_result.markdown.raw_markdown
anyio.run(stream_llm_response, url, markdown, question, provider, token)
return
# Handle output
if not output_file:
if output == "all":
click.echo(json.dumps(result.model_dump(), indent=2))
if isinstance(result, list):
output_data = [r.model_dump() for r in all_results]
click.echo(json.dumps(output_data, indent=2))
else:
click.echo(json.dumps(main_result.model_dump(), indent=2))
elif output == "json":
print(result.extracted_content)
extracted_items = json.loads(result.extracted_content)
print(main_result.extracted_content)
extracted_items = json.loads(main_result.extracted_content)
click.echo(json.dumps(extracted_items, indent=2))
elif output in ["markdown", "md"]:
click.echo(result.markdown.raw_markdown)
click.echo(main_result.markdown.raw_markdown)
elif output in ["markdown-fit", "md-fit"]:
click.echo(result.markdown.fit_markdown)
click.echo(main_result.markdown.fit_markdown)
else:
if output == "all":
with open(output_file, "w") as f:
f.write(json.dumps(result.model_dump(), indent=2))
if isinstance(result, list):
output_data = [r.model_dump() for r in all_results]
f.write(json.dumps(output_data, indent=2))
else:
f.write(json.dumps(main_result.model_dump(), indent=2))
elif output == "json":
with open(output_file, "w") as f:
f.write(result.extracted_content)
f.write(main_result.extracted_content)
elif output in ["markdown", "md"]:
with open(output_file, "w") as f:
f.write(result.markdown.raw_markdown)
f.write(main_result.markdown.raw_markdown)
elif output in ["markdown-fit", "md-fit"]:
with open(output_file, "w") as f:
f.write(result.markdown.fit_markdown)
f.write(main_result.markdown.fit_markdown)
except Exception as e:
raise click.ClickException(str(e))
@@ -1354,9 +1401,11 @@ def profiles_cmd():
@click.option("--question", "-q", help="Ask a question about the crawled content")
@click.option("--verbose", "-v", is_flag=True)
@click.option("--profile", "-p", help="Use a specific browser profile (by name)")
@click.option("--deep-crawl", type=click.Choice(["bfs", "dfs", "best-first"]), help="Enable deep crawling with specified strategy")
@click.option("--max-pages", type=int, default=10, help="Maximum number of pages to crawl in deep crawl mode")
def default(url: str, example: bool, browser_config: str, crawler_config: str, filter_config: str,
extraction_config: str, json_extract: str, schema: str, browser: Dict, crawler: Dict,
output: str, bypass_cache: bool, question: str, verbose: bool, profile: str):
output: str, bypass_cache: bool, question: str, verbose: bool, profile: str, deep_crawl: str, max_pages: int):
"""Crawl4AI CLI - Web content extraction tool
Simple Usage:
@@ -1406,7 +1455,9 @@ def default(url: str, example: bool, browser_config: str, crawler_config: str, f
bypass_cache=bypass_cache,
question=question,
verbose=verbose,
profile=profile
profile=profile,
deep_crawl=deep_crawl,
max_pages=max_pages
)
def main():

View File

@@ -23,6 +23,8 @@ from .utils import (
is_external_url,
get_base_domain,
extract_metadata_using_lxml,
extract_page_context,
calculate_link_intrinsic_score,
)
from lxml import etree
from lxml import html as lhtml
@@ -944,6 +946,72 @@ class WebScrapingStrategy(ContentScrapingStrategy):
# Update the links dictionary with unique links
links["internal"] = list(internal_links_dict.values())
links["external"] = list(external_links_dict.values())
# Extract head content for links if configured
link_preview_config = kwargs.get("link_preview_config")
if link_preview_config is not None:
try:
import asyncio
from .link_preview import LinkPreview
from .models import Links, Link
verbose = link_preview_config.verbose
if verbose:
self._log("info", "Starting link head extraction for {internal} internal and {external} external links",
params={"internal": len(links["internal"]), "external": len(links["external"])}, tag="LINK_EXTRACT")
# Convert dict links to Link objects
internal_links = [Link(**link_data) for link_data in links["internal"]]
external_links = [Link(**link_data) for link_data in links["external"]]
links_obj = Links(internal=internal_links, external=external_links)
# Create a config object for LinkPreview
class TempCrawlerRunConfig:
def __init__(self, link_config, score_links):
self.link_preview_config = link_config
self.score_links = score_links
config = TempCrawlerRunConfig(link_preview_config, kwargs.get("score_links", False))
# Extract head content (run async operation in sync context)
async def extract_links():
async with LinkPreview(self.logger) as extractor:
return await extractor.extract_link_heads(links_obj, config)
# Run the async operation
try:
# Check if we're already in an async context
loop = asyncio.get_running_loop()
# If we're in an async context, we need to run in a thread
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(asyncio.run, extract_links())
updated_links = future.result()
except RuntimeError:
# No running loop, we can use asyncio.run directly
updated_links = asyncio.run(extract_links())
# Convert back to dict format
links["internal"] = [link.dict() for link in updated_links.internal]
links["external"] = [link.dict() for link in updated_links.external]
if verbose:
successful_internal = len([l for l in updated_links.internal if l.head_extraction_status == "valid"])
successful_external = len([l for l in updated_links.external if l.head_extraction_status == "valid"])
self._log("info", "Link head extraction completed: {internal_success}/{internal_total} internal, {external_success}/{external_total} external",
params={
"internal_success": successful_internal,
"internal_total": len(updated_links.internal),
"external_success": successful_external,
"external_total": len(updated_links.external)
}, tag="LINK_EXTRACT")
else:
self._log("info", "Link head extraction completed successfully", tag="LINK_EXTRACT")
except Exception as e:
self._log("error", f"Link head extraction failed: {str(e)}", tag="LINK_EXTRACT")
# Continue with original links if extraction fails
# # Process images using ThreadPoolExecutor
imgs = body.find_all("img")
@@ -1037,6 +1105,7 @@ class LXMLWebScrapingStrategy(WebScrapingStrategy):
media: Dict[str, List],
internal_links_dict: Dict[str, Any],
external_links_dict: Dict[str, Any],
page_context: dict = None,
**kwargs,
) -> bool:
base_domain = kwargs.get("base_domain", get_base_domain(url))
@@ -1056,6 +1125,25 @@ class LXMLWebScrapingStrategy(WebScrapingStrategy):
"title": link.get("title", "").strip(),
"base_domain": base_domain,
}
# Add intrinsic scoring if enabled
if kwargs.get("score_links", False) and page_context is not None:
try:
intrinsic_score = calculate_link_intrinsic_score(
link_text=link_data["text"],
url=normalized_href,
title_attr=link_data["title"],
class_attr=link.get("class", ""),
rel_attr=link.get("rel", ""),
page_context=page_context
)
link_data["intrinsic_score"] = intrinsic_score
except Exception:
# Fail gracefully - assign default score
link_data["intrinsic_score"] = float('inf')
else:
# No scoring enabled - assign infinity (all links equal priority)
link_data["intrinsic_score"] = float('inf')
is_external = is_external_url(normalized_href, base_domain)
if is_external:
@@ -1491,6 +1579,33 @@ class LXMLWebScrapingStrategy(WebScrapingStrategy):
base_domain = get_base_domain(url)
# Extract page context for link scoring (if enabled) - do this BEFORE any removals
page_context = None
if kwargs.get("score_links", False):
try:
# Extract title
title_elements = doc.xpath('//title')
page_title = title_elements[0].text_content() if title_elements else ""
# Extract headlines
headlines = []
for tag in ['h1', 'h2', 'h3']:
elements = doc.xpath(f'//{tag}')
for el in elements:
text = el.text_content().strip()
if text:
headlines.append(text)
headlines_text = ' '.join(headlines)
# Extract meta description
meta_desc_elements = doc.xpath('//meta[@name="description"]/@content')
meta_description = meta_desc_elements[0] if meta_desc_elements else ""
# Create page context
page_context = extract_page_context(page_title, headlines_text, meta_description, url)
except Exception:
page_context = {} # Fail gracefully
# Early removal of all images if exclude_all_images is set
# This is more efficient in lxml as we remove elements before any processing
if kwargs.get("exclude_all_images", False):
@@ -1579,6 +1694,7 @@ class LXMLWebScrapingStrategy(WebScrapingStrategy):
media,
internal_links_dict,
external_links_dict,
page_context=page_context,
base_domain=base_domain,
**kwargs,
)
@@ -1623,14 +1739,84 @@ class LXMLWebScrapingStrategy(WebScrapingStrategy):
method="html",
with_tail=False,
).strip()
# Create links dictionary in the format expected by LinkPreview
links = {
"internal": list(internal_links_dict.values()),
"external": list(external_links_dict.values()),
}
# Extract head content for links if configured
link_preview_config = kwargs.get("link_preview_config")
if link_preview_config is not None:
try:
import asyncio
from .link_preview import LinkPreview
from .models import Links, Link
verbose = link_preview_config.verbose
if verbose:
self._log("info", "Starting link head extraction for {internal} internal and {external} external links",
params={"internal": len(links["internal"]), "external": len(links["external"])}, tag="LINK_EXTRACT")
# Convert dict links to Link objects
internal_links = [Link(**link_data) for link_data in links["internal"]]
external_links = [Link(**link_data) for link_data in links["external"]]
links_obj = Links(internal=internal_links, external=external_links)
# Create a config object for LinkPreview
class TempCrawlerRunConfig:
def __init__(self, link_config, score_links):
self.link_preview_config = link_config
self.score_links = score_links
config = TempCrawlerRunConfig(link_preview_config, kwargs.get("score_links", False))
# Extract head content (run async operation in sync context)
async def extract_links():
async with LinkPreview(self.logger) as extractor:
return await extractor.extract_link_heads(links_obj, config)
# Run the async operation
try:
# Check if we're already in an async context
loop = asyncio.get_running_loop()
# If we're in an async context, we need to run in a thread
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(asyncio.run, extract_links())
updated_links = future.result()
except RuntimeError:
# No running loop, we can use asyncio.run directly
updated_links = asyncio.run(extract_links())
# Convert back to dict format
links["internal"] = [link.dict() for link in updated_links.internal]
links["external"] = [link.dict() for link in updated_links.external]
if verbose:
successful_internal = len([l for l in updated_links.internal if l.head_extraction_status == "valid"])
successful_external = len([l for l in updated_links.external if l.head_extraction_status == "valid"])
self._log("info", "Link head extraction completed: {internal_success}/{internal_total} internal, {external_success}/{external_total} external",
params={
"internal_success": successful_internal,
"internal_total": len(updated_links.internal),
"external_success": successful_external,
"external_total": len(updated_links.external)
}, tag="LINK_EXTRACT")
else:
self._log("info", "Link head extraction completed successfully", tag="LINK_EXTRACT")
except Exception as e:
self._log("error", f"Error during link head extraction: {str(e)}", tag="LINK_EXTRACT")
# Continue with original links if head extraction fails
return {
"cleaned_html": cleaned_html,
"success": success,
"media": media,
"links": {
"internal": list(internal_links_dict.values()),
"external": list(external_links_dict.values()),
},
"links": links,
"metadata": meta,
}

View File

@@ -1,7 +1,7 @@
from crawl4ai import BrowserConfig, AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.hub import BaseCrawler
from crawl4ai.utils import optimize_html, get_home_folder, preprocess_html_for_schema
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
from pathlib import Path
import json
import os

395
crawl4ai/link_preview.py Normal file
View File

@@ -0,0 +1,395 @@
"""
Link Extractor for Crawl4AI
Extracts head content from links discovered during crawling using URLSeeder's
efficient parallel processing and caching infrastructure.
"""
import asyncio
import fnmatch
from typing import Dict, List, Optional, Any
from .async_logger import AsyncLogger
from .async_url_seeder import AsyncUrlSeeder
from .async_configs import SeedingConfig, CrawlerRunConfig
from .models import Links, Link
from .utils import calculate_total_score
class LinkPreview:
"""
Extracts head content from links using URLSeeder's parallel processing infrastructure.
This class provides intelligent link filtering and head content extraction with:
- Pattern-based inclusion/exclusion filtering
- Parallel processing with configurable concurrency
- Caching for performance
- BM25 relevance scoring
- Memory-safe processing for large link sets
"""
def __init__(self, logger: Optional[AsyncLogger] = None):
"""
Initialize the LinkPreview.
Args:
logger: Optional logger instance for recording events
"""
self.logger = logger
self.seeder: Optional[AsyncUrlSeeder] = None
self._owns_seeder = False
async def __aenter__(self):
"""Async context manager entry."""
await self.start()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
"""Async context manager exit."""
await self.close()
async def start(self):
"""Initialize the URLSeeder instance."""
if not self.seeder:
self.seeder = AsyncUrlSeeder(logger=self.logger)
await self.seeder.__aenter__()
self._owns_seeder = True
async def close(self):
"""Clean up resources."""
if self.seeder and self._owns_seeder:
await self.seeder.__aexit__(None, None, None)
self.seeder = None
self._owns_seeder = False
def _log(self, level: str, message: str, tag: str = "LINK_EXTRACT", **kwargs):
"""Helper method to safely log messages."""
if self.logger:
log_method = getattr(self.logger, level, None)
if log_method:
log_method(message=message, tag=tag, params=kwargs.get('params', {}))
async def extract_link_heads(
self,
links: Links,
config: CrawlerRunConfig
) -> Links:
"""
Extract head content for filtered links and attach to Link objects.
Args:
links: Links object containing internal and external links
config: CrawlerRunConfig with link_preview_config settings
Returns:
Links object with head_data attached to filtered Link objects
"""
link_config = config.link_preview_config
# Ensure seeder is initialized
await self.start()
# Filter links based on configuration
filtered_urls = self._filter_links(links, link_config)
if not filtered_urls:
self._log("info", "No links matched filtering criteria")
return links
self._log("info", "Extracting head content for {count} filtered links",
params={"count": len(filtered_urls)})
# Extract head content using URLSeeder
head_results = await self._extract_heads_parallel(filtered_urls, link_config)
# Merge results back into Link objects
updated_links = self._merge_head_data(links, head_results, config)
self._log("info", "Completed head extraction for links, {success} successful",
params={"success": len([r for r in head_results if r.get("status") == "valid"])})
return updated_links
def _filter_links(self, links: Links, link_config: Dict[str, Any]) -> List[str]:
"""
Filter links based on configuration parameters.
Args:
links: Links object containing internal and external links
link_config: Configuration dictionary for link extraction
Returns:
List of filtered URL strings
"""
filtered_urls = []
# Include internal links if configured
if link_config.include_internal:
filtered_urls.extend([link.href for link in links.internal if link.href])
self._log("debug", "Added {count} internal links",
params={"count": len(links.internal)})
# Include external links if configured
if link_config.include_external:
filtered_urls.extend([link.href for link in links.external if link.href])
self._log("debug", "Added {count} external links",
params={"count": len(links.external)})
# Apply include patterns
include_patterns = link_config.include_patterns
if include_patterns:
filtered_urls = [
url for url in filtered_urls
if any(fnmatch.fnmatch(url, pattern) for pattern in include_patterns)
]
self._log("debug", "After include patterns: {count} links remain",
params={"count": len(filtered_urls)})
# Apply exclude patterns
exclude_patterns = link_config.exclude_patterns
if exclude_patterns:
filtered_urls = [
url for url in filtered_urls
if not any(fnmatch.fnmatch(url, pattern) for pattern in exclude_patterns)
]
self._log("debug", "After exclude patterns: {count} links remain",
params={"count": len(filtered_urls)})
# Limit number of links
max_links = link_config.max_links
if max_links > 0 and len(filtered_urls) > max_links:
filtered_urls = filtered_urls[:max_links]
self._log("debug", "Limited to {max_links} links",
params={"max_links": max_links})
# Remove duplicates while preserving order
seen = set()
unique_urls = []
for url in filtered_urls:
if url not in seen:
seen.add(url)
unique_urls.append(url)
self._log("debug", "Final filtered URLs: {count} unique links",
params={"count": len(unique_urls)})
return unique_urls
async def _extract_heads_parallel(
self,
urls: List[str],
link_config: Dict[str, Any]
) -> List[Dict[str, Any]]:
"""
Extract head content for URLs using URLSeeder's parallel processing.
Args:
urls: List of URLs to process
link_config: Configuration dictionary for link extraction
Returns:
List of dictionaries with url, status, head_data, and optional relevance_score
"""
verbose = link_config.verbose
concurrency = link_config.concurrency
if verbose:
self._log("info", "Starting batch processing: {total} links with {concurrency} concurrent workers",
params={"total": len(urls), "concurrency": concurrency})
# Create SeedingConfig for URLSeeder
seeding_config = SeedingConfig(
extract_head=True,
concurrency=concurrency,
hits_per_sec=getattr(link_config, 'hits_per_sec', None),
query=link_config.query,
score_threshold=link_config.score_threshold,
scoring_method="bm25" if link_config.query else None,
verbose=verbose
)
# Use URLSeeder's extract_head_for_urls method with progress tracking
if verbose:
# Create a wrapper to track progress
results = await self._extract_with_progress(urls, seeding_config, link_config)
else:
results = await self.seeder.extract_head_for_urls(
urls=urls,
config=seeding_config,
concurrency=concurrency,
timeout=link_config.timeout
)
return results
async def _extract_with_progress(
self,
urls: List[str],
seeding_config: SeedingConfig,
link_config: Dict[str, Any]
) -> List[Dict[str, Any]]:
"""Extract head content with progress reporting."""
total_urls = len(urls)
concurrency = link_config.concurrency
batch_size = max(1, total_urls // 10) # Report progress every 10%
# Process URLs and track progress
completed = 0
successful = 0
failed = 0
# Create a custom progress tracking version
# We'll modify URLSeeder's method to include progress callbacks
# For now, let's use the existing method and report at the end
# In a production version, we would modify URLSeeder to accept progress callbacks
self._log("info", "Processing links in batches...")
# Use existing method
results = await self.seeder.extract_head_for_urls(
urls=urls,
config=seeding_config,
concurrency=concurrency,
timeout=link_config.timeout
)
# Count results
for result in results:
completed += 1
if result.get("status") == "valid":
successful += 1
else:
failed += 1
# Final progress report
self._log("info", "Batch processing completed: {completed}/{total} processed, {successful} successful, {failed} failed",
params={
"completed": completed,
"total": total_urls,
"successful": successful,
"failed": failed
})
return results
def _merge_head_data(
self,
original_links: Links,
head_results: List[Dict[str, Any]],
config: CrawlerRunConfig
) -> Links:
"""
Merge head extraction results back into Link objects.
Args:
original_links: Original Links object
head_results: Results from head extraction
Returns:
Links object with head_data attached to matching links
"""
# Create URL to head_data mapping
url_to_head_data = {}
for result in head_results:
url = result.get("url")
if url:
url_to_head_data[url] = {
"head_data": result.get("head_data", {}),
"status": result.get("status", "unknown"),
"error": result.get("error"),
"relevance_score": result.get("relevance_score")
}
# Update internal links
updated_internal = []
for link in original_links.internal:
if link.href in url_to_head_data:
head_info = url_to_head_data[link.href]
# Create new Link object with head data and scoring
contextual_score = head_info.get("relevance_score")
updated_link = Link(
href=link.href,
text=link.text,
title=link.title,
base_domain=link.base_domain,
head_data=head_info["head_data"],
head_extraction_status=head_info["status"],
head_extraction_error=head_info.get("error"),
intrinsic_score=getattr(link, 'intrinsic_score', None),
contextual_score=contextual_score
)
# Add relevance score to head_data for backward compatibility
if contextual_score is not None:
updated_link.head_data = updated_link.head_data or {}
updated_link.head_data["relevance_score"] = contextual_score
# Calculate total score combining intrinsic and contextual scores
updated_link.total_score = calculate_total_score(
intrinsic_score=updated_link.intrinsic_score,
contextual_score=updated_link.contextual_score,
score_links_enabled=getattr(config, 'score_links', False),
query_provided=bool(config.link_preview_config.query)
)
updated_internal.append(updated_link)
else:
# Keep original link unchanged
updated_internal.append(link)
# Update external links
updated_external = []
for link in original_links.external:
if link.href in url_to_head_data:
head_info = url_to_head_data[link.href]
# Create new Link object with head data and scoring
contextual_score = head_info.get("relevance_score")
updated_link = Link(
href=link.href,
text=link.text,
title=link.title,
base_domain=link.base_domain,
head_data=head_info["head_data"],
head_extraction_status=head_info["status"],
head_extraction_error=head_info.get("error"),
intrinsic_score=getattr(link, 'intrinsic_score', None),
contextual_score=contextual_score
)
# Add relevance score to head_data for backward compatibility
if contextual_score is not None:
updated_link.head_data = updated_link.head_data or {}
updated_link.head_data["relevance_score"] = contextual_score
# Calculate total score combining intrinsic and contextual scores
updated_link.total_score = calculate_total_score(
intrinsic_score=updated_link.intrinsic_score,
contextual_score=updated_link.contextual_score,
score_links_enabled=getattr(config, 'score_links', False),
query_provided=bool(config.link_preview_config.query)
)
updated_external.append(updated_link)
else:
# Keep original link unchanged
updated_external.append(link)
# Sort links by relevance score if available
if any(hasattr(link, 'head_data') and link.head_data and 'relevance_score' in link.head_data
for link in updated_internal + updated_external):
def get_relevance_score(link):
if hasattr(link, 'head_data') and link.head_data and 'relevance_score' in link.head_data:
return link.head_data['relevance_score']
return 0.0
updated_internal.sort(key=get_relevance_score, reverse=True)
updated_external.sort(key=get_relevance_score, reverse=True)
return Links(
internal=updated_internal,
external=updated_external
)

View File

@@ -345,6 +345,12 @@ class Link(BaseModel):
text: Optional[str] = ""
title: Optional[str] = ""
base_domain: Optional[str] = ""
head_data: Optional[Dict[str, Any]] = None # Head metadata extracted from link target
head_extraction_status: Optional[str] = None # "success", "failed", "skipped"
head_extraction_error: Optional[str] = None # Error message if extraction failed
intrinsic_score: Optional[float] = None # Quality score based on URL structure, text, and context
contextual_score: Optional[float] = None # BM25 relevance score based on query and head content
total_score: Optional[float] = None # Combined score from intrinsic and contextual scores
class Media(BaseModel):

View File

@@ -1054,4 +1054,525 @@ Your output must:
5. Include all required fields
6. Use valid XPath selectors
</output_requirements>
"""
"""
GENERATE_SCRIPT_PROMPT = """You are a world-class browser automation specialist. Your sole purpose is to convert a natural language objective and a snippet of HTML into the most **efficient, robust, and simple** script possible to prepare a web page for data extraction.
Your scripts run **before the crawl** to handle dynamic content, user interactions, and other obstacles. You are a master of two tools: raw **JavaScript** and the high-level **Crawl4ai Script (c4a)**.
────────────────────────────────────────────────────────
## Your Core Philosophy: "Efficiency, Robustness, Simplicity"
This is your mantra. Every line of code you write must adhere to it.
1. **Efficiency (Shortest Path):** Generate the absolute minimum number of steps to achieve the goal. Do not include redundant actions. If a `CLICK` on one button achieves the goal, don't also scroll and wait unnecessarily.
2. **Robustness (Will Not Break):** Prioritize selectors and methods that are resistant to cosmetic site changes. `data-*` attributes are gold. Dynamic, auto-generated class names (`.class-a8B_x3`) are poison. Always prefer waiting for a state change (`WAIT \`#results\``) over a blind delay (`WAIT 5`).
3. **Simplicity (Right Tool for the Job):** Use the simplest tool that works. Prefer a direct `c4a` command over `EVAL` with JavaScript. Only use `EVAL` when the task is impossible with standard commands (e.g., accessing Shadow DOM, complex array filtering).
────────────────────────────────────────────────────────
## Output Mode Selection Logic
Your choice of output mode is a critical strategic decision.
* **Use `crawl4ai_script` for:**
* Standard, sequential browser actions: login forms, clicking "next page," simple "load more" buttons, accepting cookie banners.
* When the user's goal maps clearly to the available `c4a` commands.
* When you need to define reusable macros with `PROC`.
* **Use `javascript` for:**
* Complex DOM manipulation that has no `c4a` equivalent (e.g., transforming data, complex filtering).
* Interacting with web components inside **Shadow DOM** or **iFrames**.
* Implementing sophisticated logic like custom scrolling patterns or handling non-standard events.
* When the goal is a fine-grained DOM tweak, not a full user journey.
**If the user specifies a mode, you MUST respect it.** If not, you must choose the mode that best embodies your core philosophy.
────────────────────────────────────────────────────────
## Available Crawl4ai Commands
| Command | Arguments / Notes |
|------------------------|--------------------------------------------------------------|
| GO `<url>` | Navigate to absolute URL |
| RELOAD | Hard refresh |
| BACK / FORWARD | Browser history nav |
| WAIT `<seconds>` | **Avoid!** Passive delay. Use only as a last resort. |
| WAIT \`<css>\` `<t>` | **Preferred wait.** Poll selector until found, timeout in seconds. |
| WAIT "<text>" `<t>` | Poll page text until found, timeout in seconds. |
| CLICK \`<css>\` | Single click on element |
| CLICK `<x>` `<y>` | Viewport click |
| DOUBLE_CLICK … | Two rapid clicks |
| RIGHT_CLICK … | Context-menu click |
| MOVE `<x>` `<y>` | Mouse move |
| DRAG `<x1>` `<y1>` `<x2>` `<y2>` | Click-drag gesture |
| SCROLL UP|DOWN|LEFT|RIGHT `[px]` | Viewport scroll |
| TYPE "<text>" | Type into focused element |
| CLEAR \`<css>\` | Empty input |
| SET \`<css>\` "<val>" | Set element value and dispatch events |
| PRESS `<Key>` | Keydown + keyup |
| KEY_DOWN `<Key>` / KEY_UP `<Key>` | Separate key events |
| EVAL \`<js>\` | **Your fallback.** Run JS when no direct command exists. |
| SETVAR $name = <val> | Store constant for reuse |
| PROC name … ENDPROC | Define macro |
| IF / ELSE / REPEAT | Flow control |
| USE "<file.c4a>" | Include another script, avoid circular includes |
────────────────────────────────────────────────────────
## Strategic Principles & Anti-Patterns
These are your commandments. Do not deviate.
1. **Selector Quality is Paramount:**
* **GOOD:** `[data-testid="submit-button"]`, `#main-content`, `[aria-label="Close dialog"]`
* **BAD:** `div > span:nth-child(3)`, `.button-gR3xY_s`, `//div[contains(@class, 'button')]`
2. **Wait for State, Not for Time:**
* **DO:** `CLICK \`#load-more\`` followed by `WAIT \`div.new-item\` 10`. This waits for the *result* of the action.
* **DON'T:** `CLICK \`#load-more\`` followed by `WAIT 5`. This is a guess and it will fail.
3. **Target the Action, Not the Artifact:** If you need to reveal content, click the button that reveals it. Don't try to manually change CSS `display` properties, as this can break the page's internal state.
4. **DOM-Awareness is Non-Negotiable:**
* **Shadow DOM:** `c4a` commands CANNOT pierce the Shadow DOM. If you see a `#shadow-root (open)` in the HTML, you MUST use `EVAL` and `element.shadowRoot.querySelector(...)`.
* **iFrames:** Likewise, you MUST use `EVAL` and `iframe.contentDocument.querySelector(...)` to interact with elements inside an iframe.
5. **Be Idempotent:** Your script must be harmless if run multiple times. Use `IF EXISTS` to check for states before acting (e.g., don't try to log in if already logged in).
6. **Forbidden Techniques:** Never use `document.write()`. It is destructive. Avoid overly complex JS in `EVAL` that could be simplified into a few `c4a` commands.
────────────────────────────────────────────────────────
## From Vague Goals to Robust Scripts: Your Duty to Infer and Ensure Reliability
This is your most important responsibility. Users are not automation experts. They will provide incomplete or vague instructions. Your job is to be the expert—to infer their true goal and build a script that is reliable by default. You must add the "invisible scaffolding" of checks and waits to ensure the page is stable and ready for the crawler. **A vague user prompt must still result in a robust, complete script.**
Study these examples. No matter which query is given, your output must be the single, robust solution.
### 1. Scenario: Basic Search Query
* **High Detail Query:** "Find the search box and search button. Wait for the search box to be visible, click it, clear it, type 'r2d2', click the search button, and then wait for the search results to appear."
* **Medium Detail Query:** "Find the search box and search for 'r2d2', click the search button until you get a list of items."
* **Low Detail Query:** "Search for r2d2."
**THE CORRECT, ROBUST OUTPUT (for all three queries):**
```
WAIT `input[type="search"]` 10
SET `input[type="search"]` "r2d2"
CLICK `button[aria-label="Search"]`
WAIT `div.search-results-container` 15
```
**Rationale:** You correctly infer the need to `WAIT` for the input first. You use the more efficient `SET` command. Most importantly, you **infer the crucial final step**: waiting for a results container to appear, confirming the search action was successful.
### 2. Scenario: Clicking a "Load More" Button
* **High Detail Query:** "Click the button with the text 'Load More'. Afterward, wait for a new item with the class '.product-tile' to show up on the page."
* **Medium Detail Query:** "Click the load more button to see more products."
* **Low Detail Query:** "Load more items."
**THE CORRECT, ROBUST OUTPUT:**
```
IF EXISTS `button.load-more` THEN
CLICK `button.load-more`
WAIT `div.new-item-indicator` 8
ENDIF
```
**Rationale:** You wrap the action in `IF EXISTS` to prevent errors if the button is not present (e.g., on the last page). You correctly infer the need to wait for the *consequence* of the click—a new item appearing—rather than a blind `WAIT 2`.
### 3. Scenario: Applying a Filter from a Dropdown
* **High Detail Query:** "First, click the dropdown with id 'color-filter'. Then, from the open menu, click the option that says 'Blue'. Finally, wait for the product grid to update."
* **Medium Detail Query:** "Filter the products by the color Blue."
* **Low Detail Query:** "Show blue products."
**THE CORRECT, ROBUST OUTPUT:**
```
CLICK `#color-filter`
WAIT `[data-value="blue"]` 3
CLICK `[data-value="blue"]`
WAIT `#product-grid[data-status="updated"]` 10
```
**Rationale:** You infer the need for two waits: one for the menu options to appear after the first click, and a second for the main content to update after the filter is applied. This prevents race conditions.
### 4. Scenario: User Authentication (Login)
* **High Detail Query:** "Fill username with 'USER_EMAIL', fill password with 'USER_PASS', click login, and wait for the dashboard to appear."
* **Medium Detail Query:** "Log in as USER_EMAIL with password USER_PASS."
* **Low Detail Query:** "Log in."
**THE CORRECT, ROBUST OUTPUT:**
```
IF EXISTS `[data-testid="logout-button"]` THEN
EVAL `console.log("Already logged in.")`
ELSE
WAIT `input[name="username"]` 10
SET `input[name="username"]` "USER_EMAIL"
SET `input[name="password"]` "USER_PASS"
CLICK `button[type="submit"]`
WAIT `[data-testid="user-dashboard"]` 15
ENDIF
```
**Rationale:** You build an **idempotent** script. You first check if the user is *already* logged in. If not, you proceed with the login and then, critically, `WAIT` for a post-login element to confirm success. You use placeholders when credentials are not provided in low-detail queries.
### 5. Scenario: Dismissing an Interstitial Modal
* **High Detail Query:** "Check if a popup with id '#promo-modal' exists. If it does, click the close button inside it with class '.close-x'."
* **Medium Detail Query:** "Close the promotional popup."
* **Low Detail Query:** "Get rid of the popup."
**THE CORRECT, ROBUST OUTPUT:**
```
IF EXISTS `div#promo-modal` THEN
CLICK `div#promo-modal button.close-x`
ENDIF
```
**Rationale:** You correctly identify this as a conditional action. The script must not fail if the popup doesn't appear. The `IF EXISTS` block is the perfect, robust way to handle this optional interaction.
────────────────────────────────────────────────────────
## Advanced Scenarios & Master-Level Examples
Study these solutions. Understand the *why* behind each choice.
### Scenario: Interacting with a Web Component (Shadow DOM)
**Goal:** Click a button inside a custom element `<user-card>`.
**HTML Snippet:** `<user-card><#shadow-root (open)><button>Details</button></#shadow-root></user-card>`
**Correct Mode:** `javascript` (or `c4a` with `EVAL`)
**Rationale:** Standard selectors can't cross the shadow boundary. JavaScript is mandatory.
```javascript
// Solution in pure JS mode
const card = document.querySelector('user-card');
if (card && card.shadowRoot) {
const button = card.shadowRoot.querySelector('button');
if (button) button.click();
}
```
```
# Solution in c4a mode (using EVAL as the weapon of choice)
EVAL `
const card = document.querySelector('user-card');
if (card && card.shadowRoot) {
const button = card.shadowRoot.querySelector('button');
if (button) button.click();
}
`
```
### Scenario: Handling a Cookie Banner
**Goal:** Accept the cookies to dismiss the modal.
**HTML Snippet:** `<div id="cookie-consent-modal"><button id="accept-cookies">Accept All</button></div>`
**Correct Mode:** `crawl4ai_script`
**Rationale:** A simple, direct action. `c4a` is cleaner and more declarative.
```
# The most efficient solution
IF EXISTS `#cookie-consent-modal` THEN
CLICK `#accept-cookies`
WAIT `div.content-loaded` 5
ENDIF
```
### Scenario: Infinite Scroll Page
**Goal:** Scroll down 5 times to load more content.
**HTML Snippet:** `(A page with a long body and no "load more" button)`
**Correct Mode:** `crawl4ai_script`
**Rationale:** `REPEAT` is designed for exactly this. It's more readable than a JS loop for this simple task.
```
REPEAT (
SCROLL DOWN 1000,
5
)
WAIT 2
```
### Scenario: Hover-to-Reveal Menu
**Goal:** Hover over "Products" to open the menu, then click "Laptops".
**HTML Snippet:** `<a href="/products" id="products-menu">Products</a> <div class="menu-dropdown"><a href="/laptops">Laptops</a></div>`
**Correct Mode:** `crawl4ai_script` (with `EVAL`)
**Rationale:** `c4a` has no `HOVER` command. `EVAL` is the perfect tool to dispatch the `mouseover` event.
```
EVAL `document.querySelector('#products-menu').dispatchEvent(new MouseEvent('mouseover', { bubbles: true }))`
WAIT `div.menu-dropdown a[href="/laptops"]` 3
CLICK `div.menu-dropdown a[href="/laptops"]`
```
### Scenario: Login Form
**Goal:** Fill and submit a login form.
**HTML Snippet:** `<form><input name="email"><input name="password" type="password"><button type="submit"></button></form>`
**Correct Mode:** `crawl4ai_script`
**Rationale:** This is the canonical use case for `c4a`. The commands map 1:1 to the user journey.
```
WAIT `form` 10
SET `input[name="email"]` "USER_EMAIL"
SET `input[name="password"]` "USER_PASS"
CLICK `button[type="submit"]`
WAIT `[data-testid="user-dashboard"]` 12
```
────────────────────────────────────────────────────────
## Final Output Mandate
1. **CODE ONLY.** Your entire response must be the script body.
2. **NO CHAT.** Do not say "Here is the script" or "This should work."
3. **NO MARKDOWN.** Do not wrap your code in ` ``` ` fences.
4. **NO COMMENTS.** Do not add comments to the final code output.
5. **SYNTACTICALLY PERFECT.** The script must be immediately executable.
6. **UTF-8, STANDARD QUOTES.** Use `"` for string literals, not `“` or `”`.
You are an engine of automation. Now, receive the user's request and produce the optimal script."""
GENERATE_JS_SCRIPT_PROMPT = """# The World-Class JavaScript Automation Scripter
You are a world-class browser automation specialist. Your sole purpose is to convert a natural language objective and a snippet of HTML into the most **efficient, robust, and simple** pure JavaScript script possible to prepare a web page for data extraction.
Your scripts will be executed directly in the browser (e.g., via Playwright's `page.evaluate()`) to handle dynamic content, user interactions, and other obstacles before the page is crawled. You are a master of browser-native JavaScript APIs.
────────────────────────────────────────────────────────
## Your Core Philosophy: "Efficiency, Robustness, Simplicity"
This is your mantra. Every line of JavaScript you write must adhere to it.
1. **Efficiency (Shortest Path):** Generate the absolute minimum number of steps to achieve the goal. Do not include redundant actions. Your code should be concise and direct.
2. **Robustness (Will Not Break):** Prioritize selectors that are resistant to cosmetic site changes. `data-*` attributes are gold. Dynamic, auto-generated class names (`.class-a8B_x3`) are poison. Always prefer waiting for a state change over a blind `setTimeout`.
3. **Simplicity (Right Tool for the Job):** Use simple, direct DOM methods (`.querySelector`, `.click()`) whenever possible. Avoid overly complex or fragile logic when a simpler approach exists.
────────────────────────────────────────────────────────
## Essential JavaScript Automation Patterns & Toolkit
All code should be wrapped in an `async` Immediately Invoked Function Expression `(async () => { ... })();` to allow for top-level `await` and to avoid polluting the global scope.
| Task | Best-Practice JavaScript Implementation |
| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Wait for Element** | Create and use a robust `waitForElement` helper function. This is your most important tool. <br> `const waitForElement = (selector, timeout = 10000) => new Promise((resolve, reject) => { const el = document.querySelector(selector); if (el) return resolve(el); const observer = new MutationObserver(() => { const el = document.querySelector(selector); if (el) { observer.disconnect(); resolve(el); } }); observer.observe(document.body, { childList: true, subtree: true }); setTimeout(() => { observer.disconnect(); reject(new Error(`Timeout waiting for ${selector}`)); }, timeout); });` |
| **Click Element** | `const el = await waitForElement('selector'); if (el) el.click();` |
| **Set Input Value** | `const input = await waitForElement('selector'); if (input) { input.value = 'new value'; input.dispatchEvent(new Event('input', { bubbles: true })); input.dispatchEvent(new Event('change', { bubbles: true })); }` <br> *Crucially, always dispatch `input` and `change` events to trigger framework reactivity.* |
| **Check Existence** | `const el = document.querySelector('selector'); if (el) { /* ... it exists */ }` |
| **Scroll** | `window.scrollBy(0, window.innerHeight);` |
| **Deal with Time** | Use `await new Promise(r => setTimeout(r, 500));` for short, unavoidable pauses after an action. **Avoid long, blind waits.** |
REMEMBER: Make sure to generate very deterministic css selector. If you refer to a specific button, then be specific, otherwise you may capture elements you do not need, be very specific about the element you want to interact with.
────────────────────────────────────────────────────────
## The Art of High-Specificity Selectors: Your Defense Against Ambiguity
This is your most critical skill for ensuring robustness. **You must assume the provided HTML is only a small fragment of the entire page.** A selector that looks unique in the fragment could be disastrously generic on the full page. Your primary defense is to **anchor your selectors to the most specific, stable parent element available in the given HTML context.**
Think of it as creating a "sandbox" for your selectors.
**Your Guiding Principle:** Start from a unique parent, then find the child.
### Scenario: Selecting a Submit Button within a Login Form
**HTML Snippet Provided:**
```html
<div class="user-auth-module" id="login-widget">
<h2>Member Login</h2>
<form action="/login">
<input name="email" type="email">
<input name="password" type="password">
<button type="submit">Sign In</button>
</form>
</div>
```
* **TERRIBLE (High Risk):** `button[type="submit"]`
* **Why it's bad:** There could be dozens of other forms on the full page (e.g., a newsletter signup, a search bar in the header). This selector is a shot in the dark.
* **BETTER (Lower Risk):** `#login-widget button[type="submit"]`
* **Why it's better:** It's anchored to a unique ID (`#login-widget`). This dramatically reduces the chance of ambiguity.
* **EXCELLENT (Minimal Risk):** `div[id="login-widget"] form button[type="submit"]`
* **Why it's best:** This is a highly specific, descriptive path. It says, "Find the login widget, then the form inside it, and then the submit button inside *that* form." It is virtually guaranteed to be unique and is resilient to minor layout changes within the form.
### Scenario: Selecting a "Add to Cart" Button
**HTML Snippet Provided:**
```html
<section data-testid="product-details-main">
<h1>Awesome T-Shirt</h1>
<div class="product-actions">
<button class="add-to-cart-btn">Add to Cart</button>
</div>
</section>
```
* **TERRIBLE (High Risk):** `.add-to-cart-btn`
* **Why it's bad:** A "related products" section outside this snippet might also use the same class name.
* **EXCELLENT (Minimal Risk):** `[data-testid="product-details-main"] .add-to-cart-btn`
* **Why it's best:** It uses the stable `data-testid` attribute of the parent section as an anchor. This is the most robust pattern.
**Your Mandate:** Always examine the provided HTML for a stable, unique parent (like an element with an `id`, a `data-testid`, or a highly specific combination of classes) and use it as the root of your selectors. **NEVER generate a generic, un-anchored selector if a better, more specific parent is available in the context.**
────────────────────────────────────────────────────────
## Strategic Principles & Anti-Patterns
These are your commandments. Do not deviate.
1. **Selector Quality is Paramount:**
* **GOOD:** `[data-testid="submit-button"]`, `#main-content`, `[aria-label="Close dialog"]`
* **BAD:** `div > span:nth-child(3)`, `.button-gR3xY_s`, `//div[contains(@class, 'button')]`
2. **Wait for State, Not for Time:**
* **DO:** `(await waitForElement('#load-more')).click(); await waitForElement('div.new-item');` This waits for the *result* of the action.
* **DON'T:** `document.querySelector('#load-more').click(); await new Promise(r => setTimeout(r, 5000));` This is a guess and it will fail.
3. **Target the Action, Not the Artifact:** If you need to reveal content, click the button that reveals it. Don't try to manually change CSS `display` properties, as this can break the page's internal state.
4. **DOM-Awareness is Non-Negotiable:**
* **Shadow DOM:** You MUST use `element.shadowRoot.querySelector(...)` to access elements inside a `#shadow-root (open)`.
* **iFrames:** You MUST use `iframe.contentDocument.querySelector(...)` to interact with elements inside an iframe.
5. **Be Idempotent:** Your script must be harmless if run multiple times. Use `if (document.querySelector(...))` checks to avoid re-doing actions unnecessarily.
6. **Forbidden Techniques:** Never use `document.write()`. It is destructive.
────────────────────────────────────────────────────────
## From Vague Goals to Robust Scripts: Your Duty to Infer and Ensure Reliability
This is your most important responsibility. Users are not automation experts. They will provide incomplete or vague instructions. Your job is to be the expert—to infer their true goal and build a script that is reliable by default. **A vague user prompt must still result in a robust, complete script.**
Study these examples. No matter which query is given, your output must be the single, robust solution.
### 1. Scenario: Basic Search Query
* **High Detail Query:** "Find the search box and search button. Wait for the search box to be visible, click it, clear it, type 'r2d2', click the search button, and then wait for the search results to appear."
* **Medium Detail Query:** "Find the search box and search for 'r2d2'."
* **Low Detail Query:** "Search for r2d2."
**THE CORRECT, ROBUST JAVASCRIPT OUTPUT (for all three queries):**
```javascript
(async () => {
const waitForElement = (selector, timeout = 10000) => new Promise((resolve, reject) => { const el = document.querySelector(selector); if (el) return resolve(el); const observer = new MutationObserver(() => { const el = document.querySelector(selector); if (el) { observer.disconnect(); resolve(el); } }); observer.observe(document.body, { childList: true, subtree: true }); setTimeout(() => { observer.disconnect(); reject(new Error(`Timeout waiting for ${selector}`)); }, timeout); });
try {
const searchInput = await waitForElement('input[type="search"], input[aria-label*="search"]');
searchInput.value = 'r2d2';
searchInput.dispatchEvent(new Event('input', { bubbles: true }));
const searchButton = await waitForElement('button[type="submit"], button[aria-label*="search"]');
searchButton.click();
await waitForElement('div.search-results-container, #search-results');
} catch (e) {
console.error('Search script failed:', e.message);
}
})();
```
### 2. Scenario: Clicking a "Load More" Button
* **High Detail Query:** "Click the button with the text 'Load More'. Afterward, wait for a new item with the class '.product-tile' to show up."
* **Medium Detail Query:** "Click the load more button."
* **Low Detail Query:** "Load more items."
**THE CORRECT, ROBUST JAVASCRIPT OUTPUT:**
```javascript
(async () => {
const loadMoreButton = document.querySelector('button.load-more, [data-testid="load-more"]');
if (loadMoreButton) {
const initialItemCount = document.querySelectorAll('.product-tile').length;
loadMoreButton.click();
const waitForNewItem = (timeout = 8000) => new Promise((resolve, reject) => { const t0 = Date.now(); const check = () => { if (document.querySelectorAll('.product-tile').length > initialItemCount) return resolve(); if (Date.now() - t0 > timeout) return reject(new Error('Timeout waiting for new items to load.')); setTimeout(check, 200); }; check(); });
await waitForNewItem();
}
})();
```
### 3. Scenario: User Authentication (Login)
* **High Detail Query:** "Fill username with 'USER_EMAIL', password with 'USER_PASS', click login, and wait for the dashboard."
* **Medium Detail Query:** "Log in as USER_EMAIL."
* **Low Detail Query:** "Log in."
**THE CORRECT, ROBUST JAVASCRIPT OUTPUT:**
```javascript
(async () => {
if (document.querySelector('[data-testid="logout-button"]')) {
console.log('Already logged in.');
return;
}
const waitForElement = (selector, timeout = 10000) => new Promise((resolve, reject) => { const el = document.querySelector(selector); if (el) return resolve(el); const observer = new MutationObserver(() => { const el = document.querySelector(selector); if (el) { observer.disconnect(); resolve(el); } }); observer.observe(document.body, { childList: true, subtree: true }); setTimeout(() => { observer.disconnect(); reject(new Error(`Timeout waiting for ${selector}`)); }, timeout); });
try {
const userInput = await waitForElement('input[name*="user"], input[name*="email"]');
userInput.value = 'USER_EMAIL';
userInput.dispatchEvent(new Event('input', { bubbles: true }));
const passInput = await waitForElement('input[name*="pass"], input[type="password"]');
passInput.value = 'USER_PASS';
passInput.dispatchEvent(new Event('input', { bubbles: true }));
const submitButton = await waitForElement('button[type="submit"]');
submitButton.click();
await waitForElement('[data-testid="user-dashboard"], #dashboard, .account-page');
} catch (e) {
console.error('Login script failed:', e.message);
}
})();
```
────────────────────────────────────────────────────────
## The Art of High-Specificity Selectors: Your Defense Against Ambiguity
This is your most critical skill for ensuring robustness. **You must assume the provided HTML is only a small fragment of the entire page.** A selector that looks unique in the fragment could be disastrously generic on the full page. Your primary defense is to **anchor your selectors to the most specific, stable parent element available in the given HTML context.**
Think of it as creating a "sandbox" for your selectors.
**Your Guiding Principle:** Start from a unique parent, then find the child.
### Scenario: Selecting a Submit Button within a Login Form
**HTML Snippet Provided:**
```html
<div class="user-auth-module" id="login-widget">
<h2>Member Login</h2>
<form action="/login">
<input name="email" type="email">
<input name="password" type="password">
<button type="submit">Sign In</button>
</form>
</div>
```
* **TERRIBLE (High Risk):** `button[type="submit"]`
* **Why it's bad:** There could be dozens of other forms on the full page (e.g., a newsletter signup, a search bar in the header). This selector is a shot in the dark.
* **BETTER (Lower Risk):** `#login-widget button[type="submit"]`
* **Why it's better:** It's anchored to a unique ID (`#login-widget`). This dramatically reduces the chance of ambiguity.
* **EXCELLENT (Minimal Risk):** `div[id="login-widget"] form button[type="submit"]`
* **Why it's best:** This is a highly specific, descriptive path. It says, "Find the login widget, then the form inside it, and then the submit button inside *that* form." It is virtually guaranteed to be unique and is resilient to minor layout changes within the form.
### Scenario: Selecting a "Add to Cart" Button
**HTML Snippet Provided:**
```html
<section data-testid="product-details-main">
<h1>Awesome T-Shirt</h1>
<div class="product-actions">
<button class="add-to-cart-btn">Add to Cart</button>
</div>
</section>
```
* **TERRIBLE (High Risk):** `.add-to-cart-btn`
* **Why it's bad:** A "related products" section outside this snippet might also use the same class name.
* **EXCELLENT (Minimal Risk):** `[data-testid="product-details-main"] .add-to-cart-btn`
* **Why it's best:** It uses the stable `data-testid` attribute of the parent section as an anchor. This is the most robust pattern.
**Your Mandate:** Always examine the provided HTML for a stable, unique parent (like an element with an `id`, a `data-testid`, or a highly specific combination of classes) and use it as the root of your selectors. **NEVER generate a generic, un-anchored selector if a better, more specific parent is available in the context.**
────────────────────────────────────────────────────────
## Final Output Mandate
1. **CODE ONLY.** Your entire response must be the script body.
2. **NO CHAT.** Do not say "Here is the script" or "This should work."
3. **NO MARKDOWN.** Do not wrap your code in ` ``` ` fences.
4. **NO COMMENTS.** Do not add comments to the final code output, except within the logic where it's a best practice.
5. **SYNTACTICALLY PERFECT.** The script must be a single, self-contained block, immediately executable. Wrap it in `(async () => { ... })();`.
6. **UTF-8, STANDARD QUOTES.** Use `'` for string literals, not `“` or `”`.
You are an engine of automation. Now, receive the user's request and produce the optimal JavaScript."""

View File

@@ -0,0 +1,35 @@
"""
C4A-Script: A domain-specific language for web automation in Crawl4AI
"""
from .c4a_compile import C4ACompiler, compile, validate, compile_file
from .c4a_result import (
CompilationResult,
ValidationResult,
ErrorDetail,
WarningDetail,
ErrorType,
Severity,
Suggestion
)
__all__ = [
# Main compiler
"C4ACompiler",
# Convenience functions
"compile",
"validate",
"compile_file",
# Result types
"CompilationResult",
"ValidationResult",
"ErrorDetail",
"WarningDetail",
# Enums
"ErrorType",
"Severity",
"Suggestion"
]

View File

@@ -0,0 +1,398 @@
"""
Clean C4A-Script API with Result pattern
No exceptions - always returns results
"""
from __future__ import annotations
import pathlib
import re
from typing import Union, List, Optional
# JSON_SCHEMA_BUILDER is still used elsewhere,
# but we now also need the new script-builder prompt.
from ..prompts import GENERATE_JS_SCRIPT_PROMPT, GENERATE_SCRIPT_PROMPT
import logging
import re
from .c4a_result import (
CompilationResult, ValidationResult, ErrorDetail, WarningDetail,
ErrorType, Severity, Suggestion
)
from .c4ai_script import Compiler
from lark.exceptions import UnexpectedToken, UnexpectedCharacters, VisitError
from ..async_configs import LLMConfig
from ..utils import perform_completion_with_backoff
class C4ACompiler:
"""Main compiler with result-based API"""
# Error code mapping
ERROR_CODES = {
"missing_then": "E001",
"missing_paren": "E002",
"missing_comma": "E003",
"missing_endproc": "E004",
"undefined_proc": "E005",
"missing_backticks": "E006",
"invalid_command": "E007",
"syntax_error": "E999"
}
@classmethod
def compile(cls, script: Union[str, List[str]], root: Optional[pathlib.Path] = None) -> CompilationResult:
"""
Compile C4A-Script to JavaScript
Args:
script: C4A-Script as string or list of lines
root: Root directory for includes
Returns:
CompilationResult with success status and JS code or errors
"""
# Normalize input
if isinstance(script, list):
script_text = '\n'.join(script)
script_lines = script
else:
script_text = script
script_lines = script.split('\n')
try:
# Try compilation
compiler = Compiler(root)
js_code = compiler.compile(script_text)
# Success!
result = CompilationResult(
success=True,
js_code=js_code,
metadata={
"lineCount": len(script_lines),
"statementCount": len(js_code)
}
)
# Add any warnings (future feature)
# result.warnings = cls._check_warnings(script_text)
return result
except Exception as e:
# Convert exception to ErrorDetail
error = cls._exception_to_error(e, script_lines)
return CompilationResult(
success=False,
errors=[error],
metadata={
"lineCount": len(script_lines)
}
)
@classmethod
def validate(cls, script: Union[str, List[str]]) -> ValidationResult:
"""
Validate script syntax without generating code
Args:
script: C4A-Script to validate
Returns:
ValidationResult with validity status and any errors
"""
result = cls.compile(script)
return ValidationResult(
valid=result.success,
errors=result.errors,
warnings=result.warnings
)
@classmethod
def compile_file(cls, path: Union[str, pathlib.Path]) -> CompilationResult:
"""
Compile a C4A-Script file
Args:
path: Path to the file
Returns:
CompilationResult
"""
path = pathlib.Path(path)
if not path.exists():
error = ErrorDetail(
type=ErrorType.RUNTIME,
code="E100",
severity=Severity.ERROR,
message=f"File not found: {path}",
line=0,
column=0,
source_line=""
)
return CompilationResult(success=False, errors=[error])
try:
script = path.read_text()
return cls.compile(script, root=path.parent)
except Exception as e:
error = ErrorDetail(
type=ErrorType.RUNTIME,
code="E101",
severity=Severity.ERROR,
message=f"Error reading file: {str(e)}",
line=0,
column=0,
source_line=""
)
return CompilationResult(success=False, errors=[error])
@classmethod
def _exception_to_error(cls, exc: Exception, script_lines: List[str]) -> ErrorDetail:
"""Convert an exception to ErrorDetail"""
if isinstance(exc, UnexpectedToken):
return cls._handle_unexpected_token(exc, script_lines)
elif isinstance(exc, UnexpectedCharacters):
return cls._handle_unexpected_chars(exc, script_lines)
elif isinstance(exc, ValueError):
return cls._handle_value_error(exc, script_lines)
else:
# Generic error
return ErrorDetail(
type=ErrorType.SYNTAX,
code=cls.ERROR_CODES["syntax_error"],
severity=Severity.ERROR,
message=str(exc),
line=1,
column=1,
source_line=script_lines[0] if script_lines else ""
)
@classmethod
def _handle_unexpected_token(cls, exc: UnexpectedToken, script_lines: List[str]) -> ErrorDetail:
"""Handle UnexpectedToken errors"""
line = exc.line
column = exc.column
# Get context lines
source_line = script_lines[line - 1] if 0 < line <= len(script_lines) else ""
line_before = script_lines[line - 2] if line > 1 and line <= len(script_lines) + 1 else None
line_after = script_lines[line] if 0 < line < len(script_lines) else None
# Determine error type and suggestions
if exc.token.type == 'CLICK' and 'THEN' in str(exc.expected):
code = cls.ERROR_CODES["missing_then"]
message = "Missing 'THEN' keyword after IF condition"
suggestions = [
Suggestion(
"Add 'THEN' after the condition",
source_line.replace("CLICK", "THEN CLICK") if source_line else None
)
]
elif exc.token.type == '$END':
code = cls.ERROR_CODES["missing_endproc"]
message = "Unexpected end of script"
suggestions = [
Suggestion("Check for missing ENDPROC"),
Suggestion("Ensure all procedures are properly closed")
]
elif 'RPAR' in str(exc.expected):
code = cls.ERROR_CODES["missing_paren"]
message = "Missing closing parenthesis ')'"
suggestions = [
Suggestion("Add closing parenthesis at the end of the condition")
]
elif 'COMMA' in str(exc.expected):
code = cls.ERROR_CODES["missing_comma"]
message = "Missing comma ',' in command"
suggestions = [
Suggestion("Add comma between arguments")
]
else:
# Check if this might be missing backticks
if exc.token.type == 'NAME' and 'BACKTICK_STRING' in str(exc.expected):
code = cls.ERROR_CODES["missing_backticks"]
message = "Selector must be wrapped in backticks"
suggestions = [
Suggestion(
"Wrap the selector in backticks",
f"`{exc.token.value}`"
)
]
else:
code = cls.ERROR_CODES["syntax_error"]
message = f"Unexpected '{exc.token.value}'"
if exc.expected:
expected_list = [str(e) for e in exc.expected if not str(e).startswith('_')][:3]
if expected_list:
message += f". Expected: {', '.join(expected_list)}"
suggestions = []
return ErrorDetail(
type=ErrorType.SYNTAX,
code=code,
severity=Severity.ERROR,
message=message,
line=line,
column=column,
source_line=source_line,
line_before=line_before,
line_after=line_after,
suggestions=suggestions
)
@classmethod
def _handle_unexpected_chars(cls, exc: UnexpectedCharacters, script_lines: List[str]) -> ErrorDetail:
"""Handle UnexpectedCharacters errors"""
line = exc.line
column = exc.column
source_line = script_lines[line - 1] if 0 < line <= len(script_lines) else ""
# Check for missing backticks
if "CLICK" in source_line and column > source_line.find("CLICK"):
code = cls.ERROR_CODES["missing_backticks"]
message = "Selector must be wrapped in backticks"
suggestions = [
Suggestion(
"Wrap the selector in backticks",
re.sub(r'CLICK\s+([^\s]+)', r'CLICK `\1`', source_line)
)
]
else:
code = cls.ERROR_CODES["syntax_error"]
message = f"Invalid character at position {column}"
suggestions = []
return ErrorDetail(
type=ErrorType.SYNTAX,
code=code,
severity=Severity.ERROR,
message=message,
line=line,
column=column,
source_line=source_line,
suggestions=suggestions
)
@classmethod
def _handle_value_error(cls, exc: ValueError, script_lines: List[str]) -> ErrorDetail:
"""Handle ValueError (runtime errors)"""
message = str(exc)
# Check for undefined procedure
if "Unknown procedure" in message:
proc_match = re.search(r"'([^']+)'", message)
if proc_match:
proc_name = proc_match.group(1)
# Find the line with the procedure call
for i, line in enumerate(script_lines):
if proc_name in line and not line.strip().startswith('PROC'):
return ErrorDetail(
type=ErrorType.RUNTIME,
code=cls.ERROR_CODES["undefined_proc"],
severity=Severity.ERROR,
message=f"Undefined procedure '{proc_name}'",
line=i + 1,
column=line.find(proc_name) + 1,
source_line=line,
suggestions=[
Suggestion(
f"Define the procedure before using it",
f"PROC {proc_name}\n # commands here\nENDPROC"
)
]
)
# Generic runtime error
return ErrorDetail(
type=ErrorType.RUNTIME,
code="E999",
severity=Severity.ERROR,
message=message,
line=1,
column=1,
source_line=script_lines[0] if script_lines else ""
)
@staticmethod
def generate_script(
html: str,
query: str | None = None,
mode: str = "c4a",
llm_config: LLMConfig | None = None,
**completion_kwargs,
) -> str:
"""
One-shot helper that calls the LLM exactly once to convert a
natural-language goal + HTML snippet into either:
1. raw JavaScript (`mode="js"`)
2. Crawl4ai DSL (`mode="c4a"`)
The returned string is guaranteed to be free of markdown wrappers
or explanatory text, ready for direct execution.
"""
if llm_config is None:
llm_config = LLMConfig() # falls back to env vars / defaults
# Build the user chunk
user_prompt = "\n".join(
[
"## GOAL",
"<<goael>>",
(query or "Prepare the page for crawling."),
"<</goal>>",
"",
"## HTML",
"<<html>>",
html[:100000], # guardrail against token blast
"<</html>>",
"",
"## MODE",
mode,
]
)
# Call the LLM with retry/back-off logic
full_prompt = f"{GENERATE_SCRIPT_PROMPT}\n\n{user_prompt}" if mode == "c4a" else f"{GENERATE_JS_SCRIPT_PROMPT}\n\n{user_prompt}"
response = perform_completion_with_backoff(
provider=llm_config.provider,
prompt_with_variables=full_prompt,
api_token=llm_config.api_token,
json_response=False,
base_url=getattr(llm_config, 'base_url', None),
**completion_kwargs,
)
# Extract content from the response
raw_response = response.choices[0].message.content.strip()
# Strip accidental markdown fences (```js … ```)
clean = re.sub(r"^```(?:[a-zA-Z0-9_-]+)?\s*|```$", "", raw_response, flags=re.MULTILINE).strip()
if not clean:
raise RuntimeError("LLM returned empty script.")
return clean
# Convenience functions for direct use
def compile(script: Union[str, List[str]], root: Optional[pathlib.Path] = None) -> CompilationResult:
"""Compile C4A-Script to JavaScript"""
return C4ACompiler.compile(script, root)
def validate(script: Union[str, List[str]]) -> ValidationResult:
"""Validate C4A-Script syntax"""
return C4ACompiler.validate(script)
def compile_file(path: Union[str, pathlib.Path]) -> CompilationResult:
"""Compile C4A-Script file"""
return C4ACompiler.compile_file(path)

View File

@@ -0,0 +1,219 @@
"""
Result classes for C4A-Script compilation
Clean API design with no exceptions
"""
from __future__ import annotations
from dataclasses import dataclass, field
from enum import Enum
from typing import List, Dict, Any, Optional
import json
class ErrorType(Enum):
SYNTAX = "syntax"
SEMANTIC = "semantic"
RUNTIME = "runtime"
class Severity(Enum):
ERROR = "error"
WARNING = "warning"
INFO = "info"
@dataclass
class Suggestion:
"""A suggestion for fixing an error"""
message: str
fix: Optional[str] = None
def to_dict(self) -> dict:
return {
"message": self.message,
"fix": self.fix
}
@dataclass
class ErrorDetail:
"""Detailed information about a compilation error"""
# Core info
type: ErrorType
code: str # E001, E002, etc.
severity: Severity
message: str
# Location
line: int
column: int
# Context
source_line: str
# Optional fields with defaults
end_line: Optional[int] = None
end_column: Optional[int] = None
line_before: Optional[str] = None
line_after: Optional[str] = None
# Help
suggestions: List[Suggestion] = field(default_factory=list)
documentation_url: Optional[str] = None
def to_dict(self) -> dict:
"""Convert to dictionary for JSON serialization"""
return {
"type": self.type.value,
"code": self.code,
"severity": self.severity.value,
"message": self.message,
"location": {
"line": self.line,
"column": self.column,
"endLine": self.end_line,
"endColumn": self.end_column
},
"context": {
"sourceLine": self.source_line,
"lineBefore": self.line_before,
"lineAfter": self.line_after,
"marker": {
"start": self.column - 1,
"length": (self.end_column - self.column) if self.end_column else 1
}
},
"suggestions": [s.to_dict() for s in self.suggestions],
"documentationUrl": self.documentation_url
}
def to_json(self) -> str:
"""Convert to JSON string"""
return json.dumps(self.to_dict(), indent=2)
@property
def formatted_message(self) -> str:
"""Returns the nice text format for terminals"""
lines = []
lines.append(f"\n{'='*60}")
lines.append(f"{self.type.value.title()} Error [{self.code}]")
lines.append(f"{'='*60}")
lines.append(f"Location: Line {self.line}, Column {self.column}")
lines.append(f"Error: {self.message}")
if self.source_line:
marker = " " * (self.column - 1) + "^"
if self.end_column:
marker += "~" * (self.end_column - self.column - 1)
lines.append(f"\nCode:")
if self.line_before:
lines.append(f" {self.line - 1: >3} | {self.line_before}")
lines.append(f" {self.line: >3} | {self.source_line}")
lines.append(f" | {marker}")
if self.line_after:
lines.append(f" {self.line + 1: >3} | {self.line_after}")
if self.suggestions:
lines.append("\nSuggestions:")
for i, suggestion in enumerate(self.suggestions, 1):
lines.append(f" {i}. {suggestion.message}")
if suggestion.fix:
lines.append(f" Fix: {suggestion.fix}")
lines.append("="*60)
return "\n".join(lines)
@property
def simple_message(self) -> str:
"""Returns just the error message without formatting"""
return f"Line {self.line}: {self.message}"
@dataclass
class WarningDetail:
"""Information about a compilation warning"""
code: str
message: str
line: int
column: int
def to_dict(self) -> dict:
return {
"code": self.code,
"message": self.message,
"line": self.line,
"column": self.column
}
@dataclass
class CompilationResult:
"""Result of C4A-Script compilation"""
success: bool
js_code: Optional[List[str]] = None
errors: List[ErrorDetail] = field(default_factory=list)
warnings: List[WarningDetail] = field(default_factory=list)
metadata: Dict[str, Any] = field(default_factory=dict)
def to_dict(self) -> dict:
"""Convert to dictionary for JSON serialization"""
return {
"success": self.success,
"jsCode": self.js_code,
"errors": [e.to_dict() for e in self.errors],
"warnings": [w.to_dict() for w in self.warnings],
"metadata": self.metadata
}
def to_json(self) -> str:
"""Convert to JSON string"""
return json.dumps(self.to_dict(), indent=2)
@property
def has_errors(self) -> bool:
"""Check if there are any errors"""
return len(self.errors) > 0
@property
def has_warnings(self) -> bool:
"""Check if there are any warnings"""
return len(self.warnings) > 0
@property
def first_error(self) -> Optional[ErrorDetail]:
"""Get the first error if any"""
return self.errors[0] if self.errors else None
def __str__(self) -> str:
"""String representation for debugging"""
if self.success:
msg = f"✓ Compilation successful"
if self.js_code:
msg += f" - {len(self.js_code)} statements generated"
if self.warnings:
msg += f" ({len(self.warnings)} warnings)"
return msg
else:
return f"✗ Compilation failed - {len(self.errors)} error(s)"
@dataclass
class ValidationResult:
"""Result of script validation"""
valid: bool
errors: List[ErrorDetail] = field(default_factory=list)
warnings: List[WarningDetail] = field(default_factory=list)
def to_dict(self) -> dict:
return {
"valid": self.valid,
"errors": [e.to_dict() for e in self.errors],
"warnings": [w.to_dict() for w in self.warnings]
}
def to_json(self) -> str:
return json.dumps(self.to_dict(), indent=2)
@property
def first_error(self) -> Optional[ErrorDetail]:
return self.errors[0] if self.errors else None

View File

@@ -0,0 +1,690 @@
"""
2025-06-03
By Unclcode:
C4A-Script Language Documentation
Feeds Crawl4AI via CrawlerRunConfig(js_code=[ ... ]) no core modifications.
"""
from __future__ import annotations
import pathlib, re, sys, textwrap
from dataclasses import dataclass
from typing import Any, Dict, List, Union
from lark import Lark, Transformer, v_args
from lark.exceptions import UnexpectedToken, UnexpectedCharacters, VisitError
# --------------------------------------------------------------------------- #
# Custom Error Classes
# --------------------------------------------------------------------------- #
class C4AScriptError(Exception):
"""Custom error class for C4A-Script compilation errors"""
def __init__(self, message: str, line: int = None, column: int = None,
error_type: str = "Syntax Error", details: str = None):
self.message = message
self.line = line
self.column = column
self.error_type = error_type
self.details = details
super().__init__(self._format_message())
def _format_message(self) -> str:
"""Format a clear error message"""
lines = [f"\n{'='*60}"]
lines.append(f"C4A-Script {self.error_type}")
lines.append(f"{'='*60}")
if self.line:
lines.append(f"Location: Line {self.line}" + (f", Column {self.column}" if self.column else ""))
lines.append(f"Error: {self.message}")
if self.details:
lines.append(f"\nDetails: {self.details}")
lines.append("="*60)
return "\n".join(lines)
@classmethod
def from_exception(cls, exc: Exception, script: Union[str, List[str]]) -> 'C4AScriptError':
"""Create C4AScriptError from another exception"""
script_text = script if isinstance(script, str) else '\n'.join(script)
script_lines = script_text.split('\n')
if isinstance(exc, UnexpectedToken):
# Extract line and column from UnexpectedToken
line = exc.line
column = exc.column
# Get the problematic line
if 0 < line <= len(script_lines):
problem_line = script_lines[line - 1]
marker = " " * (column - 1) + "^"
details = f"\nCode:\n {problem_line}\n {marker}\n"
# Improve error message based on context
if exc.token.type == 'CLICK' and 'THEN' in str(exc.expected):
message = "Missing 'THEN' keyword after IF condition"
elif exc.token.type == '$END':
message = "Unexpected end of script. Check for missing ENDPROC or incomplete commands"
elif 'RPAR' in str(exc.expected):
message = "Missing closing parenthesis ')'"
elif 'COMMA' in str(exc.expected):
message = "Missing comma ',' in command"
else:
message = f"Unexpected '{exc.token}'"
if exc.expected:
expected_list = [str(e) for e in exc.expected if not e.startswith('_')]
if expected_list:
message += f". Expected: {', '.join(expected_list[:3])}"
details += f"Token: {exc.token.type} ('{exc.token.value}')"
else:
message = str(exc)
details = None
return cls(message, line, column, "Syntax Error", details)
elif isinstance(exc, UnexpectedCharacters):
# Extract line and column
line = exc.line
column = exc.column
if 0 < line <= len(script_lines):
problem_line = script_lines[line - 1]
marker = " " * (column - 1) + "^"
details = f"\nCode:\n {problem_line}\n {marker}\n"
message = f"Invalid character or unexpected text at position {column}"
else:
message = str(exc)
details = None
return cls(message, line, column, "Syntax Error", details)
elif isinstance(exc, ValueError):
# Handle runtime errors like undefined procedures
message = str(exc)
# Try to find which line caused the error
if "Unknown procedure" in message:
proc_name = re.search(r"'([^']+)'", message)
if proc_name:
proc_name = proc_name.group(1)
for i, line in enumerate(script_lines, 1):
if proc_name in line and not line.strip().startswith('PROC'):
details = f"\nCode:\n {line.strip()}\n\nMake sure the procedure '{proc_name}' is defined with PROC...ENDPROC"
return cls(f"Undefined procedure '{proc_name}'", i, None, "Runtime Error", details)
return cls(message, None, None, "Runtime Error", None)
else:
# Generic error
return cls(str(exc), None, None, "Compilation Error", None)
# --------------------------------------------------------------------------- #
# 1. Grammar
# --------------------------------------------------------------------------- #
GRAMMAR = r"""
start : line*
?line : command | proc_def | include | comment
command : wait | nav | click_cmd | double_click | right_click | move | drag | scroll
| type | clear | set_input | press | key_down | key_up
| eval_cmd | setvar | proc_call | if_cmd | repeat_cmd
wait : "WAIT" (ESCAPED_STRING|BACKTICK_STRING|NUMBER) NUMBER? -> wait_cmd
nav : "GO" URL -> go
| "RELOAD" -> reload
| "BACK" -> back
| "FORWARD" -> forward
click_cmd : "CLICK" (BACKTICK_STRING|NUMBER NUMBER) -> click
double_click : "DOUBLE_CLICK" (BACKTICK_STRING|NUMBER NUMBER) -> double_click
right_click : "RIGHT_CLICK" (BACKTICK_STRING|NUMBER NUMBER) -> right_click
move : "MOVE" coords -> move
drag : "DRAG" coords coords -> drag
scroll : "SCROLL" DIR NUMBER? -> scroll
type : "TYPE" (ESCAPED_STRING | NAME) -> type
clear : "CLEAR" BACKTICK_STRING -> clear
set_input : "SET" BACKTICK_STRING (ESCAPED_STRING | BACKTICK_STRING | NAME) -> set_input
press : "PRESS" WORD -> press
key_down : "KEY_DOWN" WORD -> key_down
key_up : "KEY_UP" WORD -> key_up
eval_cmd : "EVAL" BACKTICK_STRING -> eval_cmd
setvar : "SETVAR" NAME "=" value -> setvar
proc_call : NAME -> proc_call
proc_def : "PROC" NAME line* "ENDPROC" -> proc_def
include : "USE" ESCAPED_STRING -> include
comment : /#.*/ -> comment
if_cmd : "IF" "(" condition ")" "THEN" command ("ELSE" command)? -> if_cmd
repeat_cmd : "REPEAT" "(" command "," repeat_count ")" -> repeat_cmd
condition : not_cond | exists_cond | js_cond
not_cond : "NOT" condition -> not_cond
exists_cond : "EXISTS" BACKTICK_STRING -> exists_cond
js_cond : BACKTICK_STRING -> js_cond
repeat_count : NUMBER | BACKTICK_STRING
coords : NUMBER NUMBER
value : ESCAPED_STRING | BACKTICK_STRING | NUMBER
DIR : /(UP|DOWN|LEFT|RIGHT)/i
REST : /[^\n]+/
URL : /(http|https):\/\/[^\s]+/
NAME : /\$?[A-Za-z_][A-Za-z0-9_]*/
WORD : /[A-Za-z0-9+]+/
BACKTICK_STRING : /`[^`]*`/
%import common.NUMBER
%import common.ESCAPED_STRING
%import common.WS_INLINE
%import common.NEWLINE
%ignore WS_INLINE
%ignore NEWLINE
"""
# --------------------------------------------------------------------------- #
# 2. IR dataclasses
# --------------------------------------------------------------------------- #
@dataclass
class Cmd:
op: str
args: List[Any]
@dataclass
class Proc:
name: str
body: List[Cmd]
# --------------------------------------------------------------------------- #
# 3. AST → IR
# --------------------------------------------------------------------------- #
@v_args(inline=True)
class ASTBuilder(Transformer):
# helpers
def _strip(self, s):
if s.startswith('"') and s.endswith('"'):
return s[1:-1]
elif s.startswith('`') and s.endswith('`'):
return s[1:-1]
return s
def start(self,*i): return list(i)
def line(self,i): return i
def command(self,i): return i
# WAIT
def wait_cmd(self, rest, timeout=None):
rest_str = str(rest)
# Check if it's a number (including floats)
try:
num_val = float(rest_str)
payload = (num_val, "seconds")
except ValueError:
if rest_str.startswith('"') and rest_str.endswith('"'):
payload = (self._strip(rest_str), "text")
elif rest_str.startswith('`') and rest_str.endswith('`'):
payload = (self._strip(rest_str), "selector")
else:
payload = (rest_str, "selector")
return Cmd("WAIT", [payload, int(timeout) if timeout else None])
# NAV
def go(self,u): return Cmd("GO",[str(u)])
def reload(self): return Cmd("RELOAD",[])
def back(self): return Cmd("BACK",[])
def forward(self): return Cmd("FORWARD",[])
# CLICK, DOUBLE_CLICK, RIGHT_CLICK
def click(self, *args):
return self._handle_click("CLICK", args)
def double_click(self, *args):
return self._handle_click("DBLCLICK", args)
def right_click(self, *args):
return self._handle_click("RIGHTCLICK", args)
def _handle_click(self, op, args):
if len(args) == 1:
# Single argument - backtick string
target = self._strip(str(args[0]))
return Cmd(op, [("selector", target)])
else:
# Two arguments - coordinates
x, y = args
return Cmd(op, [("coords", int(x), int(y))])
# MOVE / DRAG / SCROLL
def coords(self,x,y): return ("coords",int(x),int(y))
def move(self,c): return Cmd("MOVE",[c])
def drag(self,c1,c2): return Cmd("DRAG",[c1,c2])
def scroll(self,dir_tok,amt=None):
return Cmd("SCROLL",[dir_tok.upper(), int(amt) if amt else 500])
# KEYS
def type(self,tok): return Cmd("TYPE",[self._strip(str(tok))])
def clear(self,sel): return Cmd("CLEAR",[self._strip(str(sel))])
def set_input(self,sel,val): return Cmd("SET",[self._strip(str(sel)), self._strip(str(val))])
def press(self,w): return Cmd("PRESS",[str(w)])
def key_down(self,w): return Cmd("KEYDOWN",[str(w)])
def key_up(self,w): return Cmd("KEYUP",[str(w)])
# FLOW
def eval_cmd(self,txt): return Cmd("EVAL",[self._strip(str(txt))])
def setvar(self,n,v):
# v might be a Token or a Tree, extract value properly
if hasattr(v, 'value'):
value = v.value
elif hasattr(v, 'children') and len(v.children) > 0:
value = v.children[0].value
else:
value = str(v)
return Cmd("SETVAR",[str(n), self._strip(value)])
def proc_call(self,n): return Cmd("CALL",[str(n)])
def proc_def(self,n,*body): return Proc(str(n),[b for b in body if isinstance(b,Cmd)])
def include(self,p): return Cmd("INCLUDE",[self._strip(p)])
def comment(self,*_): return Cmd("NOP",[])
# IF-THEN-ELSE and EXISTS
def if_cmd(self, condition, then_cmd, else_cmd=None):
return Cmd("IF", [condition, then_cmd, else_cmd])
def condition(self, cond):
return cond
def not_cond(self, cond):
return ("NOT", cond)
def exists_cond(self, selector):
return ("EXISTS", self._strip(str(selector)))
def js_cond(self, expr):
return ("JS", self._strip(str(expr)))
# REPEAT
def repeat_cmd(self, cmd, count):
return Cmd("REPEAT", [cmd, count])
def repeat_count(self, value):
return str(value)
# --------------------------------------------------------------------------- #
# 4. Compiler
# --------------------------------------------------------------------------- #
class Compiler:
def __init__(self, root: pathlib.Path|None=None):
self.parser = Lark(GRAMMAR,start="start",parser="lalr")
self.root = pathlib.Path(root or ".").resolve()
self.vars: Dict[str,Any] = {}
self.procs: Dict[str,Proc]= {}
def compile(self, text: Union[str, List[str]]) -> List[str]:
# Handle list input by joining with newlines
if isinstance(text, list):
text = '\n'.join(text)
ir = self._parse_with_includes(text)
ir = self._collect_procs(ir)
ir = self._inline_calls(ir)
ir = self._apply_set_vars(ir)
return [self._emit_js(c) for c in ir if isinstance(c,Cmd) and c.op!="NOP"]
# passes
def _parse_with_includes(self,txt,seen=None):
seen=seen or set()
cmds=ASTBuilder().transform(self.parser.parse(txt))
out=[]
for c in cmds:
if isinstance(c,Cmd) and c.op=="INCLUDE":
p=(self.root/c.args[0]).resolve()
if p in seen: raise ValueError(f"Circular include {p}")
seen.add(p); out+=self._parse_with_includes(p.read_text(),seen)
else: out.append(c)
return out
def _collect_procs(self,ir):
out=[]
for i in ir:
if isinstance(i,Proc): self.procs[i.name]=i
else: out.append(i)
return out
def _inline_calls(self,ir):
out=[]
for c in ir:
if isinstance(c,Cmd) and c.op=="CALL":
if c.args[0] not in self.procs:
raise ValueError(f"Unknown procedure {c.args[0]!r}")
out+=self._inline_calls(self.procs[c.args[0]].body)
else: out.append(c)
return out
def _apply_set_vars(self,ir):
def sub(s): return re.sub(r"\$(\w+)",lambda m:str(self.vars.get(m.group(1),m.group(0))) ,s) if isinstance(s,str) else s
out=[]
for c in ir:
if isinstance(c,Cmd):
if c.op=="SETVAR":
# Store variable
self.vars[c.args[0].lstrip('$')]=c.args[1]
else:
# Apply variable substitution to commands that use them
if c.op in("TYPE","EVAL","SET"): c.args=[sub(a) for a in c.args]
out.append(c)
return out
# JS emitter
def _emit_js(self, cmd: Cmd) -> str:
op, a = cmd.op, cmd.args
if op == "GO": return f"window.location.href = '{a[0]}';"
if op == "RELOAD": return "window.location.reload();"
if op == "BACK": return "window.history.back();"
if op == "FORWARD": return "window.history.forward();"
if op == "WAIT":
arg, kind = a[0]
timeout = a[1] or 10
if kind == "seconds":
return f"await new Promise(r=>setTimeout(r,{arg}*1000));"
if kind == "selector":
sel = arg.replace("\\","\\\\").replace("'","\\'")
return textwrap.dedent(f"""
await new Promise((res,rej)=>{{
const max = {timeout*1000}, t0 = performance.now();
const id = setInterval(()=>{{
if(document.querySelector('{sel}')){{clearInterval(id);res();}}
else if(performance.now()-t0>max){{clearInterval(id);rej('WAIT selector timeout');}}
}},100);
}});
""").strip()
if kind == "text":
txt = arg.replace('`', '\\`')
return textwrap.dedent(f"""
await new Promise((res,rej)=>{{
const max={timeout*1000},t0=performance.now();
const id=setInterval(()=>{{
if(document.body.innerText.includes(`{txt}`)){{clearInterval(id);res();}}
else if(performance.now()-t0>max){{clearInterval(id);rej('WAIT text timeout');}}
}},100);
}});
""").strip()
# click-style helpers
def _js_click(sel, evt="click", button=0, detail=1):
sel = sel.replace("'", "\\'")
return textwrap.dedent(f"""
(()=>{{
const el=document.querySelector('{sel}');
if(el){{
el.focus&&el.focus();
el.dispatchEvent(new MouseEvent('{evt}',{{bubbles:true,button:{button},detail:{detail}}}));
}}
}})();
""").strip()
def _js_click_xy(x, y, evt="click", button=0, detail=1):
return textwrap.dedent(f"""
(()=>{{
const el=document.elementFromPoint({x},{y});
if(el){{
el.focus&&el.focus();
el.dispatchEvent(new MouseEvent('{evt}',{{bubbles:true,button:{button},detail:{detail}}}));
}}
}})();
""").strip()
if op in ("CLICK", "DBLCLICK", "RIGHTCLICK"):
evt = {"CLICK":"click","DBLCLICK":"dblclick","RIGHTCLICK":"contextmenu"}[op]
btn = 2 if op=="RIGHTCLICK" else 0
det = 2 if op=="DBLCLICK" else 1
kind,*rest = a[0]
return _js_click_xy(*rest) if kind=="coords" else _js_click(rest[0],evt,btn,det)
if op == "MOVE":
_, x, y = a[0]
return textwrap.dedent(f"""
document.dispatchEvent(new MouseEvent('mousemove',{{clientX:{x},clientY:{y},bubbles:true}}));
""").strip()
if op == "DRAG":
(_, x1, y1), (_, x2, y2) = a
return textwrap.dedent(f"""
(()=>{{
const s=document.elementFromPoint({x1},{y1});
if(!s) return;
s.dispatchEvent(new MouseEvent('mousedown',{{bubbles:true,clientX:{x1},clientY:{y1}}}));
document.dispatchEvent(new MouseEvent('mousemove',{{bubbles:true,clientX:{x2},clientY:{y2}}}));
document.dispatchEvent(new MouseEvent('mouseup', {{bubbles:true,clientX:{x2},clientY:{y2}}}));
}})();
""").strip()
if op == "SCROLL":
dir_, amt = a
dx, dy = {"UP":(0,-amt),"DOWN":(0,amt),"LEFT":(-amt,0),"RIGHT":(amt,0)}[dir_]
return f"window.scrollBy({dx},{dy});"
if op == "TYPE":
txt = a[0].replace("'", "\\'")
return textwrap.dedent(f"""
(()=>{{
const el=document.activeElement;
if(el){{
el.value += '{txt}';
el.dispatchEvent(new Event('input',{{bubbles:true}}));
}}
}})();
""").strip()
if op == "CLEAR":
sel = a[0].replace("'", "\\'")
return textwrap.dedent(f"""
(()=>{{
const el=document.querySelector('{sel}');
if(el && 'value' in el){{
el.value = '';
el.dispatchEvent(new Event('input',{{bubbles:true}}));
el.dispatchEvent(new Event('change',{{bubbles:true}}));
}}
}})();
""").strip()
if op == "SET" and len(a) == 2:
# This is SET for input fields (SET `#field` "value")
sel = a[0].replace("'", "\\'")
val = a[1].replace("'", "\\'")
return textwrap.dedent(f"""
(()=>{{
const el=document.querySelector('{sel}');
if(el && 'value' in el){{
el.value = '';
el.focus&&el.focus();
el.value = '{val}';
el.dispatchEvent(new Event('input',{{bubbles:true}}));
el.dispatchEvent(new Event('change',{{bubbles:true}}));
}}
}})();
""").strip()
if op in ("PRESS","KEYDOWN","KEYUP"):
key = a[0]
evs = {"PRESS":("keydown","keyup"),"KEYDOWN":("keydown",),"KEYUP":("keyup",)}[op]
return ";".join([f"document.dispatchEvent(new KeyboardEvent('{e}',{{key:'{key}',bubbles:true}}))" for e in evs]) + ";"
if op == "EVAL":
return textwrap.dedent(f"""
(()=>{{
try {{
{a[0]};
}} catch (e) {{
console.error('C4A-Script EVAL error:', e);
}}
}})();
""").strip()
if op == "IF":
condition, then_cmd, else_cmd = a
# Generate condition JavaScript
js_condition = self._emit_condition(condition)
# Generate commands - handle both regular commands and procedure calls
then_js = self._handle_cmd_or_proc(then_cmd)
else_js = self._handle_cmd_or_proc(else_cmd) if else_cmd else ""
if else_cmd:
return textwrap.dedent(f"""
if ({js_condition}) {{
{then_js}
}} else {{
{else_js}
}}
""").strip()
else:
return textwrap.dedent(f"""
if ({js_condition}) {{
{then_js}
}}
""").strip()
if op == "REPEAT":
cmd, count = a
# Handle the count - could be number or JS expression
if count.isdigit():
# Simple number
repeat_js = self._handle_cmd_or_proc(cmd)
return textwrap.dedent(f"""
for (let _i = 0; _i < {count}; _i++) {{
{repeat_js}
}}
""").strip()
else:
# JS expression (from backticks)
count_expr = count[1:-1] if count.startswith('`') and count.endswith('`') else count
repeat_js = self._handle_cmd_or_proc(cmd)
return textwrap.dedent(f"""
(()=>{{
const _count = {count_expr};
if (typeof _count === 'number') {{
for (let _i = 0; _i < _count; _i++) {{
{repeat_js}
}}
}} else if (_count) {{
{repeat_js}
}}
}})();
""").strip()
raise ValueError(f"Unhandled op {op}")
def _emit_condition(self, condition):
"""Convert a condition tuple to JavaScript"""
cond_type = condition[0]
if cond_type == "EXISTS":
return f"!!document.querySelector('{condition[1]}')"
elif cond_type == "NOT":
# Recursively handle the negated condition
inner_condition = self._emit_condition(condition[1])
return f"!({inner_condition})"
else: # JS condition
return condition[1]
def _handle_cmd_or_proc(self, cmd):
"""Handle a command that might be a regular command or a procedure call"""
if not cmd:
return ""
if isinstance(cmd, Cmd):
if cmd.op == "CALL":
# Inline the procedure
if cmd.args[0] not in self.procs:
raise ValueError(f"Unknown procedure {cmd.args[0]!r}")
proc_body = self.procs[cmd.args[0]].body
return "\n".join([self._emit_js(c) for c in proc_body if c.op != "NOP"])
else:
return self._emit_js(cmd)
return ""
# --------------------------------------------------------------------------- #
# 5. Helpers + demo
# --------------------------------------------------------------------------- #
def compile_string(script: Union[str, List[str]], *, root: Union[pathlib.Path, None] = None) -> List[str]:
"""Compile C4A-Script from string or list of strings to JavaScript.
Args:
script: C4A-Script as a string or list of command strings
root: Root directory for resolving includes (optional)
Returns:
List of JavaScript command strings
Raises:
C4AScriptError: When compilation fails with detailed error information
"""
try:
return Compiler(root).compile(script)
except Exception as e:
# Wrap the error with better formatting
raise C4AScriptError.from_exception(e, script)
def compile_file(path: pathlib.Path) -> List[str]:
"""Compile C4A-Script from file to JavaScript.
Args:
path: Path to C4A-Script file
Returns:
List of JavaScript command strings
"""
return compile_string(path.read_text(), root=path.parent)
def compile_lines(lines: List[str], *, root: Union[pathlib.Path, None] = None) -> List[str]:
"""Compile C4A-Script from list of lines to JavaScript.
Args:
lines: List of C4A-Script command lines
root: Root directory for resolving includes (optional)
Returns:
List of JavaScript command strings
"""
return compile_string(lines, root=root)
DEMO = """
# quick sanity demo
PROC login
SET `input[name="username"]` $user
SET `input[name="password"]` $pass
CLICK `button.submit`
ENDPROC
SETVAR user = "tom@crawl4ai.com"
SETVAR pass = "hunter2"
GO https://example.com/login
WAIT `input[name="username"]` 10
login
WAIT 3
EVAL `console.log('logged in')`
"""
if __name__ == "__main__":
if len(sys.argv) == 2:
for js in compile_file(pathlib.Path(sys.argv[1])):
print(js)
else:
print("=== DEMO ===")
for js in compile_string(DEMO):
print(js)

View File

@@ -2939,3 +2939,212 @@ pip install -q nest_asyncio google-colab
echo "✅ Setup complete!"
''')
# Link Quality Scoring Functions
def extract_page_context(page_title: str, headlines_text: str, meta_description: str, base_url: str) -> dict:
"""
Extract page context for link scoring - called ONCE per page for performance.
Parser-agnostic function that takes pre-extracted data.
Args:
page_title: Title of the page
headlines_text: Combined text from h1, h2, h3 elements
meta_description: Meta description content
base_url: Base URL of the page
Returns:
Dictionary containing page context data for fast link scoring
"""
context = {
'terms': set(),
'headlines': headlines_text or '',
'meta_description': meta_description or '',
'domain': '',
'is_docs_site': False
}
try:
from urllib.parse import urlparse
parsed = urlparse(base_url)
context['domain'] = parsed.netloc.lower()
# Check if this is a documentation/reference site
context['is_docs_site'] = any(indicator in context['domain']
for indicator in ['docs.', 'api.', 'developer.', 'reference.'])
# Create term set for fast intersection (performance optimization)
all_text = ((page_title or '') + ' ' + context['headlines'] + ' ' + context['meta_description']).lower()
# Simple tokenization - fast and sufficient for scoring
context['terms'] = set(word.strip('.,!?;:"()[]{}')
for word in all_text.split()
if len(word.strip('.,!?;:"()[]{}')) > 2)
except Exception:
# Fail gracefully - return empty context
pass
return context
def calculate_link_intrinsic_score(
link_text: str,
url: str,
title_attr: str,
class_attr: str,
rel_attr: str,
page_context: dict
) -> float:
"""
Ultra-fast link quality scoring using only provided data (no DOM access needed).
Parser-agnostic function.
Args:
link_text: Text content of the link
url: Link URL
title_attr: Title attribute of the link
class_attr: Class attribute of the link
rel_attr: Rel attribute of the link
page_context: Pre-computed page context from extract_page_context()
Returns:
Quality score (0.0 - 10.0), higher is better
"""
score = 0.0
try:
# 1. ATTRIBUTE QUALITY (string analysis - very fast)
if title_attr and len(title_attr.strip()) > 3:
score += 1.0
class_str = (class_attr or '').lower()
# Navigation/important classes boost score
if any(nav_class in class_str for nav_class in ['nav', 'menu', 'primary', 'main', 'important']):
score += 1.5
# Marketing/ad classes reduce score
if any(bad_class in class_str for bad_class in ['ad', 'sponsor', 'track', 'promo', 'banner']):
score -= 1.0
rel_str = (rel_attr or '').lower()
# Semantic rel values
if any(good_rel in rel_str for good_rel in ['canonical', 'next', 'prev', 'chapter']):
score += 1.0
if any(bad_rel in rel_str for bad_rel in ['nofollow', 'sponsored', 'ugc']):
score -= 0.5
# 2. URL STRUCTURE QUALITY (string operations - very fast)
url_lower = url.lower()
# High-value path patterns
if any(good_path in url_lower for good_path in ['/docs/', '/api/', '/guide/', '/tutorial/', '/reference/', '/manual/']):
score += 2.0
elif any(medium_path in url_lower for medium_path in ['/blog/', '/article/', '/post/', '/news/']):
score += 1.0
# Penalize certain patterns
if any(bad_path in url_lower for bad_path in ['/admin/', '/login/', '/cart/', '/checkout/', '/track/', '/click/']):
score -= 1.5
# URL depth (shallow URLs often more important)
url_depth = url.count('/') - 2 # Subtract protocol and domain
if url_depth <= 2:
score += 1.0
elif url_depth > 5:
score -= 0.5
# HTTPS bonus
if url.startswith('https://'):
score += 0.5
# 3. TEXT QUALITY (string analysis - very fast)
if link_text:
text_clean = link_text.strip()
if len(text_clean) > 3:
score += 1.0
# Multi-word links are usually more descriptive
word_count = len(text_clean.split())
if word_count >= 2:
score += 0.5
if word_count >= 4:
score += 0.5
# Avoid generic link text
generic_texts = ['click here', 'read more', 'more info', 'link', 'here']
if text_clean.lower() in generic_texts:
score -= 1.0
# 4. CONTEXTUAL RELEVANCE (pre-computed page terms - very fast)
if page_context.get('terms') and link_text:
link_words = set(word.strip('.,!?;:"()[]{}').lower()
for word in link_text.split()
if len(word.strip('.,!?;:"()[]{}')) > 2)
if link_words:
# Calculate word overlap ratio
overlap = len(link_words & page_context['terms'])
if overlap > 0:
relevance_ratio = overlap / min(len(link_words), 10) # Cap to avoid over-weighting
score += relevance_ratio * 2.0 # Up to 2 points for relevance
# 5. DOMAIN CONTEXT BONUSES (very fast string checks)
if page_context.get('is_docs_site', False):
# Documentation sites: prioritize internal navigation
if link_text and any(doc_keyword in link_text.lower()
for doc_keyword in ['api', 'reference', 'guide', 'tutorial', 'example']):
score += 1.0
except Exception:
# Fail gracefully - return minimal score
score = 0.5
# Ensure score is within reasonable bounds
return max(0.0, min(score, 10.0))
def calculate_total_score(
intrinsic_score: Optional[float] = None,
contextual_score: Optional[float] = None,
score_links_enabled: bool = False,
query_provided: bool = False
) -> float:
"""
Calculate combined total score from intrinsic and contextual scores with smart fallbacks.
Args:
intrinsic_score: Quality score based on URL structure, text, and context (0-10)
contextual_score: BM25 relevance score based on query and head content (0-1 typically)
score_links_enabled: Whether link scoring is enabled
query_provided: Whether a query was provided for contextual scoring
Returns:
Combined total score (0-10 scale)
Scoring Logic:
- No scoring: return 5.0 (neutral score)
- Only intrinsic: return normalized intrinsic score
- Only contextual: return contextual score scaled to 10
- Both: weighted combination (70% intrinsic, 30% contextual scaled)
"""
# Case 1: No scoring enabled at all
if not score_links_enabled:
return 5.0 # Neutral score - all links treated equally
# Normalize scores to handle None values
intrinsic = intrinsic_score if intrinsic_score is not None else 0.0
contextual = contextual_score if contextual_score is not None else 0.0
# Case 2: Only intrinsic scoring (no query provided or no head extraction)
if not query_provided or contextual_score is None:
# Use intrinsic score directly (already 0-10 scale)
return max(0.0, min(intrinsic, 10.0))
# Case 3: Both intrinsic and contextual scores available
# Scale contextual score (typically 0-1) to 0-10 range
contextual_scaled = min(contextual * 10.0, 10.0)
# Weighted combination: 70% intrinsic (structure/content quality) + 30% contextual (query relevance)
# This gives more weight to link quality while still considering relevance
total = (intrinsic * 0.7) + (contextual_scaled * 0.3)
return max(0.0, min(total, 10.0))

View File

@@ -7901,7 +7901,7 @@ from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.extraction_strategy import (
from crawl4ai import (
JsonCssExtractionStrategy,
LLMExtractionStrategy,
)
@@ -8301,7 +8301,7 @@ async def crawl_dynamic_content_pages_method_2():
async def cosine_similarity_extraction():
from crawl4ai.extraction_strategy import CosineStrategy
from crawl4ai import CosineStrategy
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=CosineStrategy(

View File

@@ -354,7 +354,7 @@ In a typical scenario, you define **one** `BrowserConfig` for your crawler sessi
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
async def main():
# 1) Browser config: headless, bigger viewport, no proxy
@@ -1042,7 +1042,7 @@ You can combine content selection with a more advanced extraction strategy. For
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
async def main():
# Minimal schema for repeated items
@@ -1094,7 +1094,7 @@ import asyncio
import json
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import LLMExtractionStrategy
class ArticleData(BaseModel):
headline: str
@@ -1139,7 +1139,7 @@ Below is a short function that unifies **CSS selection**, **exclusion** logic, a
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
async def extract_main_articles(url: str):
schema = {
@@ -1488,7 +1488,7 @@ If you run a JSON-based extraction strategy (CSS, XPath, LLM, etc.), the structu
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
async def main():
schema = {
@@ -4722,7 +4722,7 @@ if __name__ == "__main__":
Once dynamic content is loaded, you can attach an **`extraction_strategy`** (like `JsonCssExtractionStrategy` or `LLMExtractionStrategy`). For example:
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
schema = {
"name": "Commits",
@@ -4902,7 +4902,7 @@ Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. B
> **New!** Crawl4AI now provides a powerful utility to automatically generate extraction schemas using LLM. This is a one-time cost that gives you a reusable schema for fast, LLM-free extractions:
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai import LLMConfig
# Generate a schema (one-time cost)
@@ -4932,7 +4932,7 @@ Here's a basic extraction example:
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
async def main():
schema = {
@@ -4987,7 +4987,7 @@ import json
import asyncio
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import LLMExtractionStrategy
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
@@ -5103,7 +5103,7 @@ Some sites require multiple “page clicks” or dynamic JavaScript updates. Bel
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
@@ -7300,7 +7300,7 @@ Here's an example of crawling GitHub commits across multiple pages while preserv
```python
from crawl4ai.async_configs import CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai.cache_context import CacheMode
async def crawl_dynamic_content():
@@ -7850,7 +7850,7 @@ The Cosine Strategy:
## Basic Usage
```python
from crawl4ai.extraction_strategy import CosineStrategy
from crawl4ai import CosineStrategy
strategy = CosineStrategy(
semantic_filter="product reviews", # Target content type
@@ -8161,7 +8161,7 @@ import json
from pydantic import BaseModel, Field
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import LLMExtractionStrategy
class Product(BaseModel):
name: str
@@ -8278,7 +8278,7 @@ import asyncio
from typing import List
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import LLMExtractionStrategy
class Entity(BaseModel):
name: str
@@ -8423,7 +8423,7 @@ Lets begin with a **simple** schema-based extraction using the `JsonCssExtrac
import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
async def extract_crypto_prices():
# 1. Define a simple extraction schema
@@ -8493,7 +8493,7 @@ Below is a short example demonstrating **XPath** extraction plus the **`raw://`*
import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
from crawl4ai import JsonXPathExtractionStrategy
async def extract_crypto_prices_xpath():
# 1. Minimal dummy HTML with some repeating rows
@@ -8694,7 +8694,7 @@ Key Takeaways:
import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
ecommerce_schema = {
# ... the advanced schema from above ...
@@ -8804,7 +8804,7 @@ While manually crafting schemas is powerful and precise, Crawl4AI now offers a c
The schema generator is available as a static method on both `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`. You can choose between OpenAI's GPT-4 or the open-source Ollama for schema generation:
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, JsonXPathExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy, JsonXPathExtractionStrategy
from crawl4ai import LLMConfig
# Sample HTML with product information

View File

@@ -14,3 +14,4 @@ anyio==4.9.0
PyJWT==2.10.1
mcp>=1.6.0
websockets>=15.0.1
httpx[http2]>=0.27.2

File diff suppressed because it is too large Load Diff

View File

@@ -5,7 +5,7 @@ prices, ratings, and other details using CSS selectors.
"""
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json

View File

@@ -5,7 +5,7 @@ prices, ratings, and other details using CSS selectors.
"""
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json
from playwright.async_api import Page, BrowserContext

View File

@@ -5,7 +5,7 @@ prices, ratings, and other details using CSS selectors.
"""
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.6 MiB

View File

@@ -0,0 +1,132 @@
<!DOCTYPE html>
<html>
<head>
<title>Append-Only Scroll (Traditional Infinite Scroll)</title>
<style>
body {
font-family: Arial, sans-serif;
margin: 0;
padding: 20px;
background-color: #f5f5f5;
}
h1 {
color: #333;
text-align: center;
}
.posts-container {
max-width: 800px;
margin: 0 auto;
background: white;
border: 1px solid #ddd;
border-radius: 5px;
padding: 20px;
}
.post {
background: #f9f9f9;
padding: 15px;
margin-bottom: 15px;
border-radius: 5px;
border: 1px solid #eee;
}
.post-title {
font-size: 18px;
font-weight: bold;
color: #2c3e50;
margin-bottom: 10px;
}
.post-content {
color: #555;
line-height: 1.6;
}
.loading {
text-align: center;
padding: 20px;
color: #888;
}
</style>
</head>
<body>
<h1>Traditional Infinite Scroll Demo</h1>
<p style="text-align: center; color: #666;">This appends new content without removing old content</p>
<div class="posts-container"></div>
<script>
// Traditional infinite scroll - APPENDS content
const container = document.querySelector('.posts-container');
const totalPosts = 200;
const postsPerPage = 20;
let loadedPosts = 0;
let isLoading = false;
// Generate fake post data
function generatePost(index) {
return {
id: index,
title: `Post Title #${index + 1}`,
content: `This is the content of post ${index + 1}. In traditional infinite scroll, new content is appended to existing content. The DOM keeps growing. Post ID: ${index}`
};
}
// Load more posts - APPENDS to existing content
function loadMorePosts() {
if (isLoading || loadedPosts >= totalPosts) return;
isLoading = true;
// Show loading indicator
const loadingDiv = document.createElement('div');
loadingDiv.className = 'loading';
loadingDiv.textContent = 'Loading more posts...';
container.appendChild(loadingDiv);
// Simulate network delay
setTimeout(() => {
// Remove loading indicator
container.removeChild(loadingDiv);
// Add new posts
const fragment = document.createDocumentFragment();
const endIndex = Math.min(loadedPosts + postsPerPage, totalPosts);
for (let i = loadedPosts; i < endIndex; i++) {
const post = generatePost(i);
const postElement = document.createElement('div');
postElement.className = 'post';
postElement.setAttribute('data-post-id', post.id);
postElement.innerHTML = `
<div class="post-title">${post.title}</div>
<div class="post-content">${post.content}</div>
`;
fragment.appendChild(postElement);
}
// APPEND new posts to existing ones
container.appendChild(fragment);
loadedPosts = endIndex;
isLoading = false;
console.log(`Loaded ${loadedPosts} of ${totalPosts} posts`);
}, 300);
}
// Initial load
loadMorePosts();
// Load more on scroll
window.addEventListener('scroll', () => {
const scrollBottom = window.innerHeight + window.scrollY;
const threshold = document.body.offsetHeight - 500;
if (scrollBottom >= threshold) {
loadMorePosts();
}
});
</script>
</body>
</html>

View File

@@ -0,0 +1,158 @@
<!DOCTYPE html>
<html>
<head>
<title>Instagram-like Grid Virtual Scroll</title>
<style>
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif;
margin: 0;
padding: 20px;
background-color: #fafafa;
}
h1 {
text-align: center;
color: #262626;
font-weight: 300;
}
.feed-container {
max-width: 935px;
margin: 0 auto;
height: 800px;
overflow-y: auto;
background: white;
border: 1px solid #dbdbdb;
border-radius: 3px;
}
.grid {
display: grid;
grid-template-columns: repeat(3, 1fr);
gap: 28px;
padding: 28px;
}
.post {
aspect-ratio: 1;
background: #f0f0f0;
border-radius: 3px;
position: relative;
overflow: hidden;
cursor: pointer;
}
.post:hover .overlay {
opacity: 1;
}
.post img {
width: 100%;
height: 100%;
object-fit: cover;
}
.overlay {
position: absolute;
top: 0;
left: 0;
right: 0;
bottom: 0;
background: rgba(0, 0, 0, 0.3);
display: flex;
align-items: center;
justify-content: center;
color: white;
font-size: 14px;
opacity: 0;
transition: opacity 0.2s;
}
.stats {
display: flex;
gap: 20px;
}
</style>
</head>
<body>
<h1>Instagram Grid Virtual Scroll</h1>
<p style="text-align: center; color: #8e8e8e;">Grid layout with virtual scrolling - only visible rows are rendered</p>
<div class="feed-container">
<div class="grid" id="grid"></div>
</div>
<script>
// Instagram-like grid virtual scroll
const grid = document.getElementById('grid');
const container = document.querySelector('.feed-container');
const totalPosts = 999; // Instagram style count
const postsPerRow = 3;
const rowsPerPage = 4; // 12 posts per page
const postsPerPage = postsPerRow * rowsPerPage;
let currentStartIndex = 0;
// Generate fake Instagram post data
const allPosts = [];
for (let i = 0; i < totalPosts; i++) {
allPosts.push({
id: i,
likes: Math.floor(Math.random() * 10000),
comments: Math.floor(Math.random() * 500),
imageNumber: (i % 10) + 1 // Cycle through 10 placeholder images
});
}
// Render grid - REPLACES content for performance
function renderGrid(startIndex) {
const posts = [];
const endIndex = Math.min(startIndex + postsPerPage, totalPosts);
for (let i = startIndex; i < endIndex; i++) {
const post = allPosts[i];
posts.push(`
<div class="post" data-post-id="${post.id}">
<img src="data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' width='400' height='400'%3E%3Crect width='400' height='400' fill='%23${Math.floor(Math.random()*16777215).toString(16)}'/%3E%3Ctext x='50%25' y='50%25' text-anchor='middle' dy='.3em' font-family='Arial' font-size='48' fill='white'%3E${post.id + 1}%3C/text%3E%3C/svg%3E" alt="Post ${post.id + 1}">
<div class="overlay">
<div class="stats">
<span>❤️ ${post.likes.toLocaleString()}</span>
<span>💬 ${post.comments}</span>
</div>
</div>
</div>
`);
}
// REPLACE grid content (virtual scroll)
grid.innerHTML = posts.join('');
currentStartIndex = startIndex;
}
// Initial render
renderGrid(0);
// Handle scroll
let scrollTimeout;
container.addEventListener('scroll', () => {
clearTimeout(scrollTimeout);
scrollTimeout = setTimeout(() => {
const scrollTop = container.scrollTop;
const scrollHeight = container.scrollHeight;
const clientHeight = container.clientHeight;
// Calculate which "page" we should show
const scrollPercentage = scrollTop / (scrollHeight - clientHeight);
const targetIndex = Math.floor(scrollPercentage * (totalPosts - postsPerPage) / postsPerPage) * postsPerPage;
// When scrolled to bottom, show next page
if (scrollTop + clientHeight >= scrollHeight - 100) {
const nextIndex = currentStartIndex + postsPerPage;
if (nextIndex < totalPosts) {
renderGrid(nextIndex);
container.scrollTop = 100; // Reset scroll for continuous experience
}
}
}, 50);
});
</script>
</body>
</html>

View File

@@ -0,0 +1,210 @@
<!DOCTYPE html>
<html>
<head>
<title>News Feed with Mixed Scroll Behavior</title>
<style>
body {
font-family: Georgia, serif;
margin: 0;
padding: 20px;
background-color: #f8f8f8;
}
h1 {
text-align: center;
color: #1a1a1a;
font-size: 32px;
margin-bottom: 10px;
}
.description {
text-align: center;
color: #666;
margin-bottom: 20px;
}
#newsContainer {
max-width: 900px;
margin: 0 auto;
height: 700px;
overflow-y: auto;
background: white;
box-shadow: 0 2px 10px rgba(0,0,0,0.1);
padding: 20px;
}
.article {
margin-bottom: 30px;
padding-bottom: 30px;
border-bottom: 1px solid #e0e0e0;
}
.article:last-child {
border-bottom: none;
}
.article-header {
margin-bottom: 15px;
}
.category {
display: inline-block;
background: #ff6b6b;
color: white;
padding: 4px 12px;
font-size: 12px;
text-transform: uppercase;
border-radius: 3px;
margin-bottom: 10px;
}
.headline {
font-size: 24px;
font-weight: bold;
color: #1a1a1a;
margin: 10px 0;
line-height: 1.3;
}
.meta {
color: #888;
font-size: 14px;
margin-bottom: 15px;
}
.content {
font-size: 16px;
line-height: 1.8;
color: #333;
}
.featured {
background: #fff9e6;
padding: 20px;
border-radius: 5px;
margin-bottom: 30px;
}
.featured .category {
background: #ffa500;
}
</style>
</head>
<body>
<h1>📰 Dynamic News Feed</h1>
<p class="description">Mixed behavior: Featured articles stay, regular articles use virtual scroll</p>
<div id="newsContainer"></div>
<script>
const container = document.getElementById('newsContainer');
const totalArticles = 100;
const articlesPerPage = 5;
let currentRegularIndex = 0;
// Categories for variety
const categories = ['Politics', 'Technology', 'Business', 'Science', 'Sports', 'Entertainment'];
// Generate article data
const featuredArticles = [];
const regularArticles = [];
// 3 featured articles that always stay
for (let i = 0; i < 3; i++) {
featuredArticles.push({
id: `featured-${i}`,
category: 'Featured',
headline: `Breaking: Major Story ${i + 1} That Stays Visible`,
date: new Date().toLocaleDateString(),
content: `This is featured article ${i + 1}. Featured articles remain in the DOM and are not replaced during scrolling. They provide important persistent content.`
});
}
// Regular articles that get virtualized
for (let i = 0; i < totalArticles; i++) {
regularArticles.push({
id: `article-${i}`,
category: categories[i % categories.length],
headline: `${categories[i % categories.length]} News: Article ${i + 1} of ${totalArticles}`,
date: new Date(Date.now() - i * 86400000).toLocaleDateString(),
content: `This is regular article ${i + 1}. These articles are replaced as you scroll to maintain performance. Only a subset is shown at any time. Article ID: ${i}`
});
}
// Render articles - Featured stay, regular ones are replaced
function renderArticles(regularStartIndex) {
const html = [];
// Always show featured articles
featuredArticles.forEach(article => {
html.push(`
<div class="article featured" data-article-id="${article.id}">
<div class="article-header">
<span class="category">${article.category}</span>
<h2 class="headline">${article.headline}</h2>
<div class="meta">📅 ${article.date}</div>
</div>
<div class="content">${article.content}</div>
</div>
`);
});
// Add divider
html.push('<div style="text-align: center; color: #999; margin: 20px 0;">— Latest News —</div>');
// Show current page of regular articles (virtual scroll)
const endIndex = Math.min(regularStartIndex + articlesPerPage, totalArticles);
for (let i = regularStartIndex; i < endIndex; i++) {
const article = regularArticles[i];
html.push(`
<div class="article" data-article-id="${article.id}">
<div class="article-header">
<span class="category" style="background: ${getCategoryColor(article.category)}">${article.category}</span>
<h2 class="headline">${article.headline}</h2>
<div class="meta">📅 ${article.date}</div>
</div>
<div class="content">${article.content}</div>
</div>
`);
}
container.innerHTML = html.join('');
currentRegularIndex = regularStartIndex;
}
function getCategoryColor(category) {
const colors = {
'Politics': '#e74c3c',
'Technology': '#3498db',
'Business': '#2ecc71',
'Science': '#9b59b6',
'Sports': '#f39c12',
'Entertainment': '#e91e63'
};
return colors[category] || '#95a5a6';
}
// Initial render
renderArticles(0);
// Handle scroll
container.addEventListener('scroll', () => {
const scrollTop = container.scrollTop;
const scrollHeight = container.scrollHeight;
const clientHeight = container.clientHeight;
// When near bottom, load next page of regular articles
if (scrollTop + clientHeight >= scrollHeight - 200) {
const nextIndex = currentRegularIndex + articlesPerPage;
if (nextIndex < totalArticles) {
renderArticles(nextIndex);
// Scroll to where regular articles start
const regularStart = document.querySelector('.article:not(.featured)');
if (regularStart) {
container.scrollTop = regularStart.offsetTop - 100;
}
}
}
});
</script>
</body>
</html>

View File

@@ -0,0 +1,122 @@
<!DOCTYPE html>
<html>
<head>
<title>Twitter-like Virtual Scroll</title>
<style>
body {
font-family: Arial, sans-serif;
margin: 0;
padding: 20px;
background-color: #f0f2f5;
}
h1 {
color: #1da1f2;
text-align: center;
}
#timeline {
max-width: 600px;
margin: 0 auto;
height: 600px;
overflow-y: auto;
background: white;
border: 1px solid #e1e8ed;
border-radius: 10px;
}
.tweet {
padding: 15px;
border-bottom: 1px solid #e1e8ed;
min-height: 80px;
}
.tweet:hover {
background-color: #f7f9fa;
}
.author {
font-weight: bold;
color: #14171a;
margin-bottom: 5px;
}
.content {
color: #14171a;
line-height: 1.5;
}
.stats {
color: #657786;
font-size: 14px;
margin-top: 10px;
}
</style>
</head>
<body>
<h1>Virtual Scroll Demo - Twitter Style</h1>
<p style="text-align: center; color: #666;">This simulates Twitter's timeline where content is replaced as you scroll</p>
<div id="timeline"></div>
<script>
// Simulate Twitter-like virtual scrolling where DOM elements are replaced
const timeline = document.getElementById('timeline');
const totalTweets = 500;
const tweetsPerPage = 10;
let currentIndex = 0;
// Generate fake tweet data
const allTweets = [];
for (let i = 0; i < totalTweets; i++) {
allTweets.push({
id: i,
author: `User_${i + 1}`,
content: `This is tweet #${i + 1} of ${totalTweets}. Virtual scrolling replaces DOM elements to maintain performance. Unique content ID: ${i}`,
likes: Math.floor(Math.random() * 1000),
retweets: Math.floor(Math.random() * 500)
});
}
// Render tweets - REPLACES content
function renderTweets(startIndex) {
const tweets = [];
const endIndex = Math.min(startIndex + tweetsPerPage, totalTweets);
for (let i = startIndex; i < endIndex; i++) {
const tweet = allTweets[i];
tweets.push(`
<div class="tweet" data-tweet-id="${tweet.id}">
<div class="author">@${tweet.author}</div>
<div class="content">${tweet.content}</div>
<div class="stats">❤️ ${tweet.likes} | 🔁 ${tweet.retweets}</div>
</div>
`);
}
// REPLACE entire content (virtual scroll behavior)
timeline.innerHTML = tweets.join('');
currentIndex = startIndex;
}
// Initial render
renderTweets(0);
// Handle scroll
timeline.addEventListener('scroll', () => {
const scrollTop = timeline.scrollTop;
const scrollHeight = timeline.scrollHeight;
const clientHeight = timeline.clientHeight;
// When near bottom, load next page
if (scrollTop + clientHeight >= scrollHeight - 100) {
const nextIndex = currentIndex + tweetsPerPage;
if (nextIndex < totalTweets) {
renderTweets(nextIndex);
// Small scroll adjustment for continuous scrolling
timeline.scrollTop = 50;
}
}
});
</script>
</body>
</html>

View File

@@ -0,0 +1,171 @@
# Amazon R2D2 Product Search Example
A real-world demonstration of Crawl4AI's multi-step crawling with LLM-generated automation scripts.
## 🎯 What This Example Shows
This example demonstrates advanced Crawl4AI features:
- **LLM-Generated Scripts**: Automatically create C4A-Script from HTML snippets
- **Multi-Step Crawling**: Navigate through multiple pages using session persistence
- **Structured Data Extraction**: Extract product data using JSON CSS schemas
- **Visual Automation**: Watch the browser perform the search (headless=False)
## 🚀 How It Works
### 1. **Script Generation Phase**
The example uses `C4ACompiler.generate_script()` to analyze Amazon's HTML and create:
- **Search Script**: Automates filling the search box and clicking search
- **Extraction Schema**: Defines how to extract product information
### 2. **Crawling Workflow**
```
Homepage → Execute Search Script → Extract Products → Save Results
```
All steps use the same `session_id` to maintain browser state.
### 3. **Data Extraction**
Products are extracted with:
- Title, price, rating, reviews
- Delivery information
- Sponsored/Small Business badges
- Direct product URLs
## 📁 Files
- `amazon_r2d2_search.py` - Main example script
- `header.html` - Amazon search bar HTML (provided)
- `product.html` - Product card HTML (provided)
- **Generated files:**
- `generated_search_script.c4a` - Auto-generated search automation
- `generated_product_schema.json` - Auto-generated extraction rules
- `extracted_products.json` - Final scraped data
- `search_results_screenshot.png` - Visual proof of results
## 🏃 Running the Example
1. **Prerequisites**
```bash
# Ensure Crawl4AI is installed
pip install crawl4ai
# Set up LLM API key (for script generation)
export OPENAI_API_KEY="your-key-here"
```
2. **Run the scraper**
```bash
python amazon_r2d2_search.py
```
3. **Watch the magic!**
- Browser window opens (not headless)
- Navigates to Amazon.com
- Searches for "r2d2"
- Extracts all products
- Saves results to JSON
## 📊 Sample Output
```json
[
{
"title": "Death Star BB8 R2D2 Golf Balls with 20 Printed tees",
"price": "29.95",
"rating": "4.7",
"reviews_count": "184",
"delivery": "FREE delivery Thu, Jun 19",
"url": "https://www.amazon.com/Death-Star-R2D2-Balls-Printed/dp/B081XSYZMS",
"is_sponsored": true,
"small_business": true
},
...
]
```
## 🔍 Key Features Demonstrated
### Session Persistence
```python
# Same session_id across multiple arun() calls
config = CrawlerRunConfig(
session_id="amazon_r2d2_session",
# ... other settings
)
```
### LLM Script Generation
```python
# Generate automation from natural language + HTML
script = C4ACompiler.generate_script(
html=header_html,
query="Find search box, type 'r2d2', click search",
mode="c4a"
)
```
### JSON CSS Extraction
```python
# Structured data extraction with CSS selectors
schema = {
"baseSelector": "[data-component-type='s-search-result']",
"fields": [
{"name": "title", "selector": "h2 a span", "type": "text"},
{"name": "price", "selector": ".a-price-whole", "type": "text"}
]
}
```
## 🛠️ Customization
### Search Different Products
Change the search term in the script generation:
```python
search_goal = """
...
3. Type "star wars lego" into the search box
...
"""
```
### Extract More Data
Add fields to the extraction schema:
```python
"fields": [
# ... existing fields
{"name": "prime", "selector": ".s-prime", "type": "exists"},
{"name": "image_url", "selector": "img.s-image", "type": "attribute", "attribute": "src"}
]
```
### Use Different Sites
Adapt the approach for other e-commerce sites by:
1. Providing their HTML snippets
2. Adjusting the search goals
3. Updating the extraction schema
## 🎓 Learning Points
1. **No Manual Scripting**: LLM generates all automation code
2. **Session Management**: Maintain state across page navigations
3. **Robust Extraction**: Handle dynamic content and multiple products
4. **Error Handling**: Graceful fallbacks if generation fails
## 🐛 Troubleshooting
- **"No products found"**: Check if Amazon's HTML structure changed
- **"Script generation failed"**: Ensure LLM API key is configured
- **"Page timeout"**: Increase wait times in the config
- **"Session lost"**: Ensure same session_id is used consistently
## 📚 Next Steps
- Try searching for different products
- Add pagination to get more results
- Extract product details pages
- Compare prices across different sellers
- Build a price monitoring system
---
This example shows the power of combining LLM intelligence with web automation. The scripts adapt to HTML changes and natural language instructions make automation accessible to everyone!

View File

@@ -0,0 +1,202 @@
#!/usr/bin/env python3
"""
Amazon R2D2 Product Search Example using Crawl4AI
This example demonstrates:
1. Using LLM to generate C4A-Script from HTML snippets
2. Multi-step crawling with session persistence
3. JSON CSS extraction for structured product data
4. Complete workflow: homepage → search → extract products
Requirements:
- Crawl4AI with generate_script support
- LLM API key (configured in environment)
"""
import asyncio
import json
import os
from pathlib import Path
from typing import List, Dict, Any
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai.script.c4a_compile import C4ACompiler
class AmazonR2D2Scraper:
def __init__(self):
self.base_dir = Path(__file__).parent
self.search_script_path = self.base_dir / "generated_search_script.js"
self.schema_path = self.base_dir / "generated_product_schema.json"
self.results_path = self.base_dir / "extracted_products.json"
self.session_id = "amazon_r2d2_session"
async def generate_search_script(self) -> str:
"""Generate JavaScript for Amazon search interaction"""
print("🔧 Generating search script from header.html...")
# Check if already generated
if self.search_script_path.exists():
print("✅ Using cached search script")
return self.search_script_path.read_text()
# Read the header HTML
header_html = (self.base_dir / "header.html").read_text()
# Generate script using LLM
search_goal = """
Find the search box and search button, then:
1. Wait for the search box to be visible
2. Click on the search box to focus it
3. Clear any existing text
4. Type "r2d2" into the search box
5. Click the search submit button
6. Wait for navigation to complete and search results to appear
"""
try:
script = C4ACompiler.generate_script(
html=header_html,
query=search_goal,
mode="js"
)
# Save for future use
self.search_script_path.write_text(script)
print("✅ Search script generated and saved!")
print(f"📄 Script:\n{script}")
return script
except Exception as e:
print(f"❌ Error generating search script: {e}")
async def generate_product_schema(self) -> Dict[str, Any]:
"""Generate JSON CSS extraction schema from product HTML"""
print("\n🔧 Generating product extraction schema...")
# Check if already generated
if self.schema_path.exists():
print("✅ Using cached extraction schema")
return json.loads(self.schema_path.read_text())
# Read the product HTML
product_html = (self.base_dir / "product.html").read_text()
# Generate extraction schema using LLM
schema_goal = """
Create a JSON CSS extraction schema to extract:
- Product title (from the h2 element)
- Price (the dollar amount)
- Rating (star rating value)
- Number of reviews
- Delivery information
- Product URL (from the main product link)
- Whether it's sponsored
- Small business badge if present
The schema should handle multiple products on a search results page.
"""
try:
# Generate JavaScript that returns the schema
schema = JsonCssExtractionStrategy.generate_schema(
html=product_html,
query=schema_goal,
)
# Save for future use
self.schema_path.write_text(json.dumps(schema, indent=2))
print("✅ Extraction schema generated and saved!")
print(f"📄 Schema fields: {[f['name'] for f in schema['fields']]}")
return schema
except Exception as e:
print(f"❌ Error generating schema: {e}")
async def crawl_amazon(self):
"""Main crawling logic with 2 calls using same session"""
print("\n🚀 Starting Amazon R2D2 product search...")
# Generate scripts and schemas
search_script = await self.generate_search_script()
product_schema = await self.generate_product_schema()
# Configure browser (headless=False to see the action)
browser_config = BrowserConfig(
headless=False,
verbose=True,
viewport_width=1920,
viewport_height=1080
)
async with AsyncWebCrawler(config=browser_config) as crawler:
print("\n📍 Step 1: Navigate to Amazon and search for R2D2")
# FIRST CALL: Navigate to Amazon and execute search
search_config = CrawlerRunConfig(
session_id=self.session_id,
js_code= f"(() => {{ {search_script} }})()", # Execute generated JS
wait_for=".s-search-results", # Wait for search results
extraction_strategy=JsonCssExtractionStrategy(schema=product_schema),
delay_before_return_html=3.0 # Give time for results to load
)
results = await crawler.arun(
url="https://www.amazon.com",
config=search_config
)
if not results.success:
print("❌ Failed to search Amazon")
print(f"Error: {results.error_message}")
return
print("✅ Search completed successfully!")
print("✅ Product extraction completed!")
# Extract and save results
print("\n📍 Extracting product data")
if results[0].extracted_content:
products = json.loads(results[0].extracted_content)
print(f"🔍 Found {len(products)} products in search results")
print(f"✅ Extracted {len(products)} R2D2 products")
# Save results
self.results_path.write_text(
json.dumps(products, indent=2)
)
print(f"💾 Results saved to: {self.results_path}")
# Print sample results
print("\n📊 Sample Results:")
for i, product in enumerate(products[:3], 1):
print(f"\n{i}. {product['title'][:60]}...")
print(f" Price: ${product['price']}")
print(f" Rating: {product['rating']} ({product['number_of_reviews']} reviews)")
print(f" {'🏪 Small Business' if product['small_business_badge'] else ''}")
print(f" {'📢 Sponsored' if product['sponsored'] else ''}")
else:
print("❌ No products extracted")
async def main():
"""Run the Amazon scraper"""
scraper = AmazonR2D2Scraper()
await scraper.crawl_amazon()
print("\n🎉 Amazon R2D2 search example completed!")
print("Check the generated files:")
print(" - generated_search_script.js")
print(" - generated_product_schema.json")
print(" - extracted_products.json")
print(" - search_results_screenshot.png")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,114 @@
[
{
"title": "Death Star BB8 R2D2 Golf Balls with 20 Printed tees \u2022 Great Gift IDEA from Moms, DADS and Kids -",
"price": "$29.95",
"rating": "4.7 out of 5 stars",
"number_of_reviews": "184",
"delivery_info": "FREE delivery",
"product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfYXRmOjIwMDA2NzY0ODgwMjc5ODo6MDo6&url=%2FDeath-Star-R2D2-Balls-Printed%2Fdp%2FB081XSYZMS%2Fref%3Dsr_1_1_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-1-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9hdGY%26psc%3D1",
"sponsored": "Sponsored",
"small_business_badge": "Small Business"
},
{
"title": "TEENKON French Press Insulated 304 Stainless Steel Coffee Maker, 32 Oz Robot R2D2 Hand Home Coffee Presser, with Filter Screen for Brew Coffee and Tea (White)",
"price": "$49.99",
"rating": "4.3 out of 5 stars",
"number_of_reviews": "82",
"delivery_info": "Delivery",
"product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjMwMDAzNzc4Njg4MDAwMjo6MDo6&url=%2FTEENKON-French-Insulated-Stainless-Presser%2Fdp%2FB0CD3HH5PN%2Fref%3Dsr_1_17_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-17-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1",
"sponsored": "Sponsored"
},
{
"title": "3D Illusion LED Night Light,7 Colors Gradual Changing Touch Switch USB Table Lamp for Holiday Gifts or Home Decorations (R2-D2)",
"price": "$9.97",
"rating": "4.3 out of 5 stars",
"number_of_reviews": "235",
"delivery_info": "Delivery",
"product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjIwMDA0NjMwMTQwODA4MTo6MDo6&url=%2FIllusion-Gradual-Changing-Holiday-Decorations%2Fdp%2FB089NMBKF2%2Fref%3Dsr_1_18_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-18-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1",
"sponsored": "Sponsored"
},
{
"title": "Paladone Star Wars R2-D2 Headlamp with Droid Sounds, Officially Licensed Disney Star Wars Head Lamp and Reading Light",
"price": "$21.99",
"rating": "4.1 out of 5 stars",
"number_of_reviews": "66",
"delivery_info": "FREE delivery",
"product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjMwMDI1NjA0MDQwMTUwMjo6MDo6&url=%2FSounds-Officially-Licensed-Headlamp-Flashlight%2Fdp%2FB09RTDZF8J%2Fref%3Dsr_1_19_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-19-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1",
"sponsored": "Sponsored"
},
{
"title": "4 Pcs Set Star Wars Kylo Ren BB8 Stormtrooper R2D2 Silicone Travel Luggage Baggage Identification Labels ID Tag for Bag Suitcase Plane Cruise Ships with Belt Strap",
"price": "$16.99",
"rating": "4.7 out of 5 stars",
"number_of_reviews": "3,414",
"delivery_info": "FREE delivery",
"product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjIwMDAyMzk3ODkwMzIxMTo6MDo6&url=%2FFinex-Set-Suitcase-Adjustable-Stormtrooper%2Fdp%2FB01D1CBFJS%2Fref%3Dsr_1_24_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-24-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1",
"sponsored": "Sponsored",
"small_business_badge": "Small Business"
},
{
"title": "Papyrus Star Wars Birthday Card Assortment, Darth Vader, Storm Trooper, and R2-D2 (3-Count)",
"price": "$23.16",
"rating": "4.8 out of 5 stars",
"number_of_reviews": "328",
"delivery_info": "FREE delivery",
"product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjMwMDcwNzI4MjA1MzcwMjo6MDo6&url=%2FPapyrus-Birthday-Assortment-Characters-3-Count%2Fdp%2FB07YT2ZPKX%2Fref%3Dsr_1_25_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-25-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1",
"sponsored": "Sponsored"
},
{
"title": "STAR WARS R2-D2 Artoo 3D Top Motion Lamp, Mood Light | 18 Inches",
"price": "$69.99",
"rating": "4.5 out of 5 stars",
"number_of_reviews": "520",
"delivery_info": "FREE delivery",
"product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjIwMDA5NDc3MzczMTQ0MTo6MDo6&url=%2FR2-D2-Artoo-Motion-Light-Inches%2Fdp%2FB08MCWPHQR%2Fref%3Dsr_1_26_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-26-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1",
"sponsored": "Sponsored"
},
{
"title": "Saturday Park Star Wars Droids Full Sheet Set - 4 Piece 100% Organic Cotton Sheets Features R2-D2 & BB-8 - GOTS & Oeko-TEX Certified (Star Wars Official)",
"price": "$70.00",
"rating": "4.5 out of 5 stars",
"number_of_reviews": "388",
"delivery_info": "FREE delivery",
"product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjMwMDAyMzI0NDI5MDQwMjo6MDo6&url=%2FSaturday-Park-Star-Droids-Sheet%2Fdp%2FB0BBSFX4J2%2Fref%3Dsr_1_27_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-27-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1",
"sponsored": "Sponsored",
"small_business_badge": "1 sustainability feature"
},
{
"title": "AQUARIUS Star Wars R2D2 Action Figure Funky Chunky Novelty Magnet for Refrigerator, Locker, Whiteboard & Game Room Officially Licensed Merchandise & Collectibles",
"price": "$11.94",
"rating": "4.3 out of 5 stars",
"number_of_reviews": "10",
"delivery_info": "FREE delivery",
"product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjMwMDA5MDMwMzY5NjEwMjo6MDo6&url=%2FAQUARIUS-Refrigerator-Whiteboard-Merchandise-Collectibles%2Fdp%2FB09W8VKXGC%2Fref%3Dsr_1_32_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-32-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1",
"sponsored": "Sponsored"
},
{
"title": "STAR WARS C-3PO and R2-D2 Men's Crew Socks 2 Pair Pack",
"price": "$11.95",
"rating": "4.7 out of 5 stars",
"number_of_reviews": "1,272",
"delivery_info": "Delivery",
"product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjIwMDAxMDk5NDkyMTg2MTo6MDo6&url=%2FStar-Wars-R2-D2-C-3PO-Socks%2Fdp%2FB0178IU1GY%2Fref%3Dsr_1_33_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-33-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1",
"sponsored": "Sponsored"
},
{
"title": "Buckle-Down Belt Women's Cinch Star Wars R2D2 Bounding Parts3 White Black Blue Gray Available In Adjustable Sizes",
"price": "$24.95",
"rating": "4.3 out of 5 stars",
"number_of_reviews": "32",
"delivery_info": "FREE delivery",
"product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjMwMDY1OTQ5NTQ4MzkwMjo6MDo6&url=%2FWomens-Cinch-Bounding-Parts3-Inches%2Fdp%2FB07WK7RG4D%2Fref%3Dsr_1_34_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-34-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1",
"sponsored": "Sponsored",
"small_business_badge": "Small Business"
},
{
"title": "Star Wars R2D2 Metal Head Vintage Disney+ T-Shirt",
"price": "$22.99",
"rating": "4.8 out of 5 stars",
"number_of_reviews": "869",
"product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjIwMDA1OTUyMzgzNDMyMTo6MDo6&url=%2FStar-Wars-Vintage-Graphic-T-Shirt%2Fdp%2FB07H9PSNXS%2Fref%3Dsr_1_35_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-35-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1",
"sponsored": "Sponsored",
"small_business_badge": "1 sustainability feature"
}
]

View File

@@ -0,0 +1,47 @@
{
"name": "Amazon Product Search Results",
"baseSelector": "div[data-component-type='s-impression-counter']",
"fields": [
{
"name": "title",
"selector": "h2.a-size-base-plus.a-spacing-none.a-color-base.a-text-normal span",
"type": "text"
},
{
"name": "price",
"selector": "span.a-price > span.a-offscreen",
"type": "text"
},
{
"name": "rating",
"selector": "i.a-icon-star-small span.a-icon-alt",
"type": "text"
},
{
"name": "number_of_reviews",
"selector": "a.a-link-normal.s-underline-text span.a-size-base",
"type": "text"
},
{
"name": "delivery_info",
"selector": "div[data-cy='delivery-recipe'] span.a-color-base",
"type": "text"
},
{
"name": "product_url",
"selector": "a.a-link-normal.s-no-outline",
"type": "attribute",
"attribute": "href"
},
{
"name": "sponsored",
"selector": "span.puis-label-popover-default span.a-color-secondary",
"type": "text"
},
{
"name": "small_business_badge",
"selector": "span.a-size-base.a-color-base",
"type": "text"
}
]
}

View File

@@ -0,0 +1,9 @@
const searchBox = document.querySelector('#twotabsearchtextbox');
const searchButton = document.querySelector('#nav-search-submit-button');
if (searchBox && searchButton) {
searchBox.focus();
searchBox.value = '';
searchBox.value = 'r2d2';
searchButton.click();
}

View File

@@ -0,0 +1,214 @@
<div id="nav-belt" style="width: 100%;">
<div class="nav-left">
<script type="text/javascript">window.navmet.tmp = +new Date();</script>
<div id="nav-logo">
<a href="/ref=nav_logo" id="nav-logo-sprites" class="nav-logo-link nav-progressive-attribute"
aria-label="Amazon" lang="en">
<span class="nav-sprite nav-logo-base"></span>
<span id="logo-ext" class="nav-sprite nav-logo-ext nav-progressive-content"></span>
<span class="nav-logo-locale">.us</span>
</a>
</div>
<script
type="text/javascript">window.navmet.push({ key: 'Logo', end: +new Date(), begin: window.navmet.tmp });</script>
<div id="nav-global-location-slot">
<span id="nav-global-location-data-modal-action" class="a-declarative nav-progressive-attribute"
data-a-modal="{&quot;width&quot;:375, &quot;closeButton&quot;:&quot;true&quot;,&quot;popoverLabel&quot;:&quot;Choose your location&quot;, &quot;ajaxHeaders&quot;:{&quot;anti-csrftoken-a2z&quot;:&quot;hHBwllskaYQrylaW9ifYQIdmqBZOtGdKro0TWb5kDoPKAAAAAGhEMhsAAAAB&quot;}, &quot;name&quot;:&quot;glow-modal&quot;, &quot;url&quot;:&quot;/portal-migration/hz/glow/get-rendered-address-selections?deviceType=desktop&amp;pageType=Gateway&amp;storeContext=NoStoreName&amp;actionSource=desktop-modal&quot;, &quot;footer&quot;:&quot;<span class=\&quot;a-declarative\&quot; data-action=\&quot;a-popover-close\&quot; data-a-popover-close=\&quot;{}\&quot;><span class=\&quot;a-button a-button-primary\&quot;><span class=\&quot;a-button-inner\&quot;><button name=\&quot;glowDoneButton\&quot; class=\&quot;a-button-text\&quot; type=\&quot;button\&quot;>Done</button></span></span></span>&quot;,&quot;header&quot;:&quot;Choose your location&quot;}"
data-action="a-modal">
<a id="nav-global-location-popover-link" role="button" tabindex="0"
class="nav-a nav-a-2 a-popover-trigger a-declarative nav-progressive-attribute" href="">
<div class="nav-sprite nav-progressive-attribute" id="nav-packard-glow-loc-icon"></div>
<div id="glow-ingress-block">
<span class="nav-line-1 nav-progressive-content" id="glow-ingress-line1">
Deliver to
</span>
<span class="nav-line-2 nav-progressive-content" id="glow-ingress-line2">
Malaysia
</span>
</div>
</a>
</span>
<input data-addnewaddress="add-new" id="unifiedLocation1ClickAddress" name="dropdown-selection"
type="hidden" value="add-new" class="nav-progressive-attribute">
<input data-addnewaddress="add-new" id="ubbShipTo" name="dropdown-selection-ubb" type="hidden"
value="add-new" class="nav-progressive-attribute">
<input id="glowValidationToken" name="glow-validation-token" type="hidden"
value="hHBwllskaYQrylaW9ifYQIdmqBZOtGdKro0TWb5kDoPKAAAAAGhEMhsAAAAB" class="nav-progressive-attribute">
<input id="glowDestinationType" name="glow-destination-type" type="hidden" value="COUNTRY"
class="nav-progressive-attribute">
</div>
<div id="nav-global-location-toaster-script-container" class="nav-progressive-content">
<!-- NAVYAAN-GLOW-NAV-TOASTER -->
<script>
P.when('glow-toaster-strings').execute(function (S) {
S.load({ "glow-toaster-address-change-error": "An error has occurred and the address has not been updated. Please try again.", "glow-toaster-unknown-error": "An error has occurred. Please try again." });
});
</script>
<script>
P.when('glow-toaster-manager').execute(function (M) {
M.create({ "pageType": "Gateway", "aisTransitionState": null, "rancorLocationSource": "REALM_DEFAULT" })
});
</script>
</div>
</div>
<div class="nav-fill" id="nav-fill-search">
<script type="text/javascript">window.navmet.tmp = +new Date();</script>
<div id="nav-search">
<div id="nav-bar-left"></div>
<form id="nav-search-bar-form" accept-charset="utf-8" action="/s/ref=nb_sb_noss_1"
class="nav-searchbar nav-progressive-attribute" method="GET" name="site-search" role="search">
<div class="nav-left">
<div id="nav-search-dropdown-card">
<div class="nav-search-scope nav-sprite">
<div class="nav-search-facade" data-value="search-alias=aps">
<span id="nav-search-label-id" class="nav-search-label nav-progressive-content"
style="width: auto;">All</span>
<i class="nav-icon"></i>
</div>
<label id="searchDropdownDescription" for="searchDropdownBox"
class="nav-progressive-attribute" style="display:none">Select the department you want to
search in</label>
<select aria-describedby="searchDropdownDescription"
class="nav-search-dropdown searchSelect nav-progressive-attrubute nav-progressive-search-dropdown"
data-nav-digest="k+fyIAyB82R9jVEmroQ0OWwSW3A=" data-nav-selected="0"
id="searchDropdownBox" name="url" style="display: block; top: 2.5px;" tabindex="0"
title="Search in">
<option selected="selected" value="search-alias=aps">All Departments</option>
<option value="search-alias=arts-crafts-intl-ship">Arts &amp; Crafts</option>
<option value="search-alias=automotive-intl-ship">Automotive</option>
<option value="search-alias=baby-products-intl-ship">Baby</option>
<option value="search-alias=beauty-intl-ship">Beauty &amp; Personal Care</option>
<option value="search-alias=stripbooks-intl-ship">Books</option>
<option value="search-alias=fashion-boys-intl-ship">Boys' Fashion</option>
<option value="search-alias=computers-intl-ship">Computers</option>
<option value="search-alias=deals-intl-ship">Deals</option>
<option value="search-alias=digital-music">Digital Music</option>
<option value="search-alias=electronics-intl-ship">Electronics</option>
<option value="search-alias=fashion-girls-intl-ship">Girls' Fashion</option>
<option value="search-alias=hpc-intl-ship">Health &amp; Household</option>
<option value="search-alias=kitchen-intl-ship">Home &amp; Kitchen</option>
<option value="search-alias=industrial-intl-ship">Industrial &amp; Scientific</option>
<option value="search-alias=digital-text">Kindle Store</option>
<option value="search-alias=luggage-intl-ship">Luggage</option>
<option value="search-alias=fashion-mens-intl-ship">Men's Fashion</option>
<option value="search-alias=movies-tv-intl-ship">Movies &amp; TV</option>
<option value="search-alias=music-intl-ship">Music, CDs &amp; Vinyl</option>
<option value="search-alias=pets-intl-ship">Pet Supplies</option>
<option value="search-alias=instant-video">Prime Video</option>
<option value="search-alias=software-intl-ship">Software</option>
<option value="search-alias=sporting-intl-ship">Sports &amp; Outdoors</option>
<option value="search-alias=tools-intl-ship">Tools &amp; Home Improvement</option>
<option value="search-alias=toys-and-games-intl-ship">Toys &amp; Games</option>
<option value="search-alias=videogames-intl-ship">Video Games</option>
<option value="search-alias=fashion-womens-intl-ship">Women's Fashion</option>
</select>
</div>
</div>
</div>
<div class="nav-fill">
<div class="nav-search-field ">
<label for="twotabsearchtextbox" style="display: none;">Search Amazon</label>
<input type="text" id="twotabsearchtextbox" value="" name="field-keywords" autocomplete="off"
placeholder="Search Amazon" class="nav-input nav-progressive-attribute" dir="auto"
tabindex="0" aria-label="Search Amazon" role="searchbox" aria-autocomplete="list"
aria-controls="sac-autocomplete-results-container" aria-expanded="false"
aria-haspopup="grid" spellcheck="false">
</div>
<div id="nav-iss-attach"></div>
</div>
<div class="nav-right">
<div class="nav-search-submit nav-sprite">
<span id="nav-search-submit-text"
class="nav-search-submit-text nav-sprite nav-progressive-attribute" aria-label="Go">
<input id="nav-search-submit-button" type="submit"
class="nav-input nav-progressive-attribute" value="Go" tabindex="0">
</span>
</div>
</div>
<input type="hidden" id="isscrid" name="crid" value="15O5T5OCG5OZE"><input type="hidden" id="issprefix"
name="sprefix" value="r2d2,aps,588">
</form>
</div>
<script
type="text/javascript">window.navmet.push({ key: 'Search', end: +new Date(), begin: window.navmet.tmp });</script>
</div>
<div class="nav-right">
<script type="text/javascript">window.navmet.tmp = +new Date();</script>
<div id="nav-tools" class="layoutToolbarPadding">
<div class="nav-div" id="icp-nav-flyout">
<a href="/customer-preferences/edit?ie=UTF8&amp;preferencesReturnUrl=%2F&amp;ref_=topnav_lang_ais"
class="nav-a nav-a-2 icp-link-style-2" aria-label="Choose a language for shopping in Amazon United States. The current selection is English (EN).
">
<span class="icp-nav-link-inner">
<span class="nav-line-1">
</span>
<span class="nav-line-2">
<span class="icp-nav-flag icp-nav-flag-us icp-nav-flag-lop" role="img"
aria-label="United States"></span>
<div>EN</div>
</span>
</span>
</a>
<button class="nav-flyout-button nav-icon nav-arrow" aria-label="Expand to Change Language or Country"
tabindex="0" style="visibility: visible;"></button>
</div>
<div class="nav-div" id="nav-link-accountList">
<a href="https://www.amazon.com/ap/signin?openid.pape.max_auth_age=0&amp;openid.return_to=https%3A%2F%2Fwww.amazon.com%2F%3Fref_%3Dnav_ya_signin&amp;openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&amp;openid.assoc_handle=usflex&amp;openid.mode=checkid_setup&amp;openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&amp;openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0"
class="nav-a nav-a-2 nav-progressive-attribute" data-nav-ref="nav_ya_signin"
data-nav-role="signin" data-ux-jq-mouseenter="true" tabindex="0" data-csa-c-type="link"
data-csa-c-slot-id="nav-link-accountList" data-csa-c-content-id="nav_ya_signin"
aria-controls="nav-flyout-accountList" data-csa-c-id="37vs0l-z575id-52hnw3-x34ncp">
<div class="nav-line-1-container"><span id="nav-link-accountList-nav-line-1"
class="nav-line-1 nav-progressive-content">Hello, sign in</span></div>
<span class="nav-line-2 ">Account &amp; Lists
</span>
</a>
<button class="nav-flyout-button nav-icon nav-arrow" aria-label="Expand Account and Lists" tabindex="0"
style="visibility: visible;"></button>
</div>
<a href="/gp/css/order-history?ref_=nav_orders_first" class="nav-a nav-a-2 nav-progressive-attribute"
id="nav-orders" tabindex="0">
<span class="nav-line-1">Returns</span>
<span class="nav-line-2">&amp; Orders<span class="nav-icon nav-arrow"></span></span>
</a>
<a href="/gp/cart/view.html?ref_=nav_cart" aria-label="0 items in cart"
class="nav-a nav-a-2 nav-progressive-attribute" id="nav-cart">
<div id="nav-cart-count-container">
<span id="nav-cart-count" aria-hidden="true"
class="nav-cart-count nav-cart-0 nav-progressive-attribute nav-progressive-content">0</span>
<span class="nav-cart-icon nav-sprite"></span>
</div>
<div id="nav-cart-text-container" class=" nav-progressive-attribute">
<span aria-hidden="true" class="nav-line-1">
</span>
<span aria-hidden="true" class="nav-line-2">
Cart
<span class="nav-icon nav-arrow"></span>
</span>
</div>
</a>
</div>
<script
type="text/javascript">window.navmet.push({ key: 'Tools', end: +new Date(), begin: window.navmet.tmp });</script>
</div>
</div>

View File

@@ -0,0 +1,206 @@
<div class="sg-col-inner">
<div cel_widget_id="MAIN-SEARCH_RESULTS-2"
class="s-widget-container s-spacing-small s-widget-container-height-small celwidget slot=MAIN template=SEARCH_RESULTS widgetId=search-results_1"
data-csa-c-pos="1" data-csa-c-item-id="amzn1.asin.1.B081XSYZMS" data-csa-op-log-render="" data-csa-c-type="item"
data-csa-c-id="dp9zuy-vyww1v-brlmmq-fmgitb" data-cel-widget="MAIN-SEARCH_RESULTS-2">
<div data-component-type="s-impression-logger"
data-component-props="{&quot;percentageShownToFire&quot;:&quot;50&quot;,&quot;batchable&quot;:true,&quot;requiredElementSelector&quot;:&quot;.s-image:visible&quot;,&quot;url&quot;:&quot;https://unagi-na.amazon.com/1/events/com.amazon.eel.SponsoredProductsEventTracking.prod?qualifier=1749299833&amp;id=1740514893473797&amp;widgetName=sp_atf&amp;adId=200067648802798&amp;eventType=1&amp;adIndex=0&quot;}"
class="rush-component s-expand-height" data-component-id="6">
<div data-component-type="s-impression-counter"
data-component-props="{&quot;presenceCounterName&quot;:&quot;sp_delivered&quot;,&quot;testElementSelector&quot;:&quot;.s-image&quot;,&quot;hiddenCounterName&quot;:&quot;sp_hidden&quot;}"
class="rush-component s-featured-result-item s-expand-height" data-component-id="7">
<span class="a-declarative" data-version-id="v2dwi5hq8xzthf26x0gg1mcl2oj"
data-render-id="r3o8bgr5zt3kmy2jv4su6fn4kyw" data-action="puis-card-container-declarative"
data-csa-c-func-deps="aui-da-puis-card-container-declarative"
data-csa-c-item-id="amzn1.asin.B081XSYZMS" data-csa-c-posx="1" data-csa-c-type="item"
data-csa-c-owner="puis" data-csa-c-id="88w0j1-kcbf5g-80v4i9-96cv88">
<div class="puis-card-container s-card-container s-overflow-hidden aok-relative puis-expand-height puis-include-content-margin puis puis-v2dwi5hq8xzthf26x0gg1mcl2oj s-latency-cf-section puis-card-border"
data-cy="asin-faceout-container">
<div class="a-section a-spacing-base">
<div class="s-product-image-container aok-relative s-text-center s-image-overlay-grey puis-image-overlay-grey s-padding-left-small s-padding-right-small puis-spacing-small s-height-equalized puis puis-v2dwi5hq8xzthf26x0gg1mcl2oj"
data-cy="image-container" style="padding-top: 0px !important;"><span
data-component-type="s-product-image" class="rush-component"
data-version-id="v2dwi5hq8xzthf26x0gg1mcl2oj"
data-render-id="r3o8bgr5zt3kmy2jv4su6fn4kyw"><a aria-hidden="true"
class="a-link-normal s-no-outline" tabindex="-1"
href="/sspa/click?ie=UTF8&amp;spc=MToxNzQwNTE0ODkzNDczNzk3OjE3NDkyOTk4MzM6c3BfYXRmOjIwMDA2NzY0ODgwMjc5ODo6MDo6&amp;url=%2FDeath-Star-R2D2-Balls-Printed%2Fdp%2FB081XSYZMS%2Fref%3Dsr_1_1_sspa%3Fcrid%3D3C1EXMXN59Q9G%26dib%3DeyJ2IjoiMSJ9.7tBl5bhZh59L9qIPZUe9SLa2fy_HvzboxuQxvrRcAc0VUXayi9fxQFsMLyFplDE9vMkIJbP76AVpa-5-fxhNza3DqhX4tss4NlB49WPi_dA00Hw6O8qK5pDzdetYlhGgOyXOLBe7mTG9oJ5W0wcvQhEVoX9mpJk_SGeqRLWGA0dBSjYCZtiyrY8_B-DP53S7fbYwiSYtq-g7sQDXKVadRpGvUyKq7yxA0SLsU42uvoqSGb0qcd6udL1wbnTEkKmwNjNSb7xIUb-8PyE7DTPMt1ScJksn70sFQMJNkM2aK5M.x9_jYvKPnSibV1d0umUStZBxlSTSXrzVIFKqFzS8c-U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749299833%26sprefix%3Dr2d2%252Caps%252C548%26sr%3D8-1-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9hdGY%26psc%3D1">
<div class="a-section aok-relative s-image-square-aspect"><img class="s-image"
src="https://m.media-amazon.com/images/I/61kAC69zQUL._AC_UL320_.jpg"
srcset="https://m.media-amazon.com/images/I/61kAC69zQUL._AC_UL320_.jpg 1x, https://m.media-amazon.com/images/I/61kAC69zQUL._AC_UL480_FMwebp_QL65_.jpg 1.5x, https://m.media-amazon.com/images/I/61kAC69zQUL._AC_UL640_FMwebp_QL65_.jpg 2x, https://m.media-amazon.com/images/I/61kAC69zQUL._AC_UL800_FMwebp_QL65_.jpg 2.5x, https://m.media-amazon.com/images/I/61kAC69zQUL._AC_UL960_FMwebp_QL65_.jpg 3x"
alt="Sponsored Ad - Death Star BB8 R2D2 Golf Balls with 20 Printed tees • Great Gift IDEA from Moms, DADS and Kids -"
aria-hidden="true" data-image-index="1" data-image-load=""
data-image-latency="s-product-image" data-image-source-density="1">
</div>
</a></span></div>
<div class="a-section a-spacing-small puis-padding-left-small puis-padding-right-small">
<div data-cy="title-recipe"
class="a-section a-spacing-none a-spacing-top-small s-title-instructions-style">
<div class="a-row a-spacing-micro"><span class="a-declarative"
data-version-id="v2dwi5hq8xzthf26x0gg1mcl2oj"
data-render-id="r3o8bgr5zt3kmy2jv4su6fn4kyw" data-action="a-popover"
data-csa-c-func-deps="aui-da-a-popover"
data-a-popover="{&quot;name&quot;:&quot;sp-info-popover-B081XSYZMS&quot;,&quot;position&quot;:&quot;triggerVertical&quot;,&quot;popoverLabel&quot;:&quot;View Sponsored information or leave ad feedback&quot;,&quot;closeButtonLabel&quot;:&quot;Close popup&quot;,&quot;closeButton&quot;:&quot;true&quot;,&quot;dataStrategy&quot;:&quot;preload&quot;}"
data-csa-c-type="widget" data-csa-c-id="wqddan-z1l67e-lissct-rciw65"><a
href="javascript:void(0)" role="button" style="text-decoration: none;"
class="puis-label-popover puis-sponsored-label-text"><span
class="puis-label-popover-default"><span
aria-label="View Sponsored information or leave ad feedback"
class="a-color-secondary">Sponsored</span></span><span
class="puis-label-popover-hover"><span aria-hidden="true"
class="a-color-base">Sponsored</span></span> <span
class="aok-inline-block puis-sponsored-label-info-icon"></span></a></span>
<div class="a-popover-preload" id="a-popover-sp-info-popover-B081XSYZMS">
<div class="puis puis-v2dwi5hq8xzthf26x0gg1mcl2oj"><span>Youre seeing this
ad based on the products relevance to your search query.</span>
<div class="a-row"><span class="a-declarative"
data-version-id="v2dwi5hq8xzthf26x0gg1mcl2oj"
data-render-id="r3o8bgr5zt3kmy2jv4su6fn4kyw"
data-action="s-safe-ajax-modal-trigger"
data-csa-c-func-deps="aui-da-s-safe-ajax-modal-trigger"
data-s-safe-ajax-modal-trigger="{&quot;header&quot;:&quot;Leave feedback&quot;,&quot;dataStrategy&quot;:&quot;ajax&quot;,&quot;ajaxUrl&quot;:&quot;/af/sp-loom/feedback-form?pl=%7B%22adPlacementMetaData%22%3A%7B%22searchTerms%22%3A%22cjJkMg%3D%3D%22%2C%22pageType%22%3A%22Search%22%2C%22feedbackType%22%3A%22sponsoredProductsLoom%22%2C%22slotName%22%3A%22TOP%22%7D%2C%22adCreativeMetaData%22%3A%7B%22adProgramId%22%3A1024%2C%22adCreativeDetails%22%3A%5B%7B%22asin%22%3A%22B081XSYZMS%22%2C%22title%22%3A%22Death+Star+BB8+R2D2+Golf+Balls+with+20+Printed+tees+%E2%80%A2+Great+Gift+IDEA+from+Moms%2C+DADS+and+Kids+-%22%2C%22priceInfo%22%3A%7B%22amount%22%3A29.95%2C%22currencyCode%22%3A%22USD%22%7D%2C%22sku%22%3A%22starwars3pk20tees%22%2C%22adId%22%3A%22A03790291PREH7M3Q3SVS%22%2C%22campaignId%22%3A%22A01050612Q0SQZ2PTMGO9%22%2C%22advertiserIdNS%22%3Anull%2C%22selectionSignals%22%3Anull%7D%5D%7D%7D&quot;}"
data-csa-c-type="widget"
data-csa-c-id="ygslsp-ir23ei-7k9x6z-73l1tp"><a
class="a-link-normal s-underline-text s-underline-link-text s-link-style"
href="#"><span>Leave ad feedback</span> </a> </span></div>
</div>
</div>
</div><a class="a-link-normal s-line-clamp-4 s-link-style a-text-normal"
href="/sspa/click?ie=UTF8&amp;spc=MToxNzQwNTE0ODkzNDczNzk3OjE3NDkyOTk4MzM6c3BfYXRmOjIwMDA2NzY0ODgwMjc5ODo6MDo6&amp;url=%2FDeath-Star-R2D2-Balls-Printed%2Fdp%2FB081XSYZMS%2Fref%3Dsr_1_1_sspa%3Fcrid%3D3C1EXMXN59Q9G%26dib%3DeyJ2IjoiMSJ9.7tBl5bhZh59L9qIPZUe9SLa2fy_HvzboxuQxvrRcAc0VUXayi9fxQFsMLyFplDE9vMkIJbP76AVpa-5-fxhNza3DqhX4tss4NlB49WPi_dA00Hw6O8qK5pDzdetYlhGgOyXOLBe7mTG9oJ5W0wcvQhEVoX9mpJk_SGeqRLWGA0dBSjYCZtiyrY8_B-DP53S7fbYwiSYtq-g7sQDXKVadRpGvUyKq7yxA0SLsU42uvoqSGb0qcd6udL1wbnTEkKmwNjNSb7xIUb-8PyE7DTPMt1ScJksn70sFQMJNkM2aK5M.x9_jYvKPnSibV1d0umUStZBxlSTSXrzVIFKqFzS8c-U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749299833%26sprefix%3Dr2d2%252Caps%252C548%26sr%3D8-1-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9hdGY%26psc%3D1">
<h2 aria-label="Sponsored Ad - Death Star BB8 R2D2 Golf Balls with 20 Printed tees • Great Gift IDEA from Moms, DADS and Kids -"
class="a-size-base-plus a-spacing-none a-color-base a-text-normal">
<span>Death Star BB8 R2D2 Golf Balls with 20 Printed tees • Great Gift IDEA
from Moms, DADS and Kids -</span></h2>
</a>
</div>
<div data-cy="reviews-block" class="a-section a-spacing-none a-spacing-top-micro">
<div class="a-row a-size-small"><span class="a-declarative"
data-version-id="v2dwi5hq8xzthf26x0gg1mcl2oj"
data-render-id="r3o8bgr5zt3kmy2jv4su6fn4kyw" data-action="a-popover"
data-csa-c-func-deps="aui-da-a-popover"
data-a-popover="{&quot;position&quot;:&quot;triggerBottom&quot;,&quot;popoverLabel&quot;:&quot;4.7 out of 5 stars, rating details&quot;,&quot;url&quot;:&quot;/review/widgets/average-customer-review/popover/ref=acr_search__popover?ie=UTF8&amp;asin=B081XSYZMS&amp;ref_=acr_search__popover&amp;contextId=search&quot;,&quot;closeButton&quot;:true,&quot;closeButtonLabel&quot;:&quot;&quot;}"
data-csa-c-type="widget" data-csa-c-id="oykdvt-8s1ebj-2kegf2-7ii7tp"><a
aria-label="4.7 out of 5 stars, rating details"
href="javascript:void(0)" role="button"
class="a-popover-trigger a-declarative"><i
data-cy="reviews-ratings-slot" aria-hidden="true"
class="a-icon a-icon-star-small a-star-small-4-5"><span
class="a-icon-alt">4.7 out of 5 stars</span></i><i
class="a-icon a-icon-popover"></i></a></span> <span
data-component-type="s-client-side-analytics" class="rush-component"
data-version-id="v2dwi5hq8xzthf26x0gg1mcl2oj"
data-render-id="r3o8bgr5zt3kmy2jv4su6fn4kyw" data-component-id="8">
<div style="display: inline-block"
class="s-csa-instrumentation-wrapper alf-search-csa-instrumentation-wrapper"
data-csa-c-type="alf-af-component"
data-csa-c-content-id="alf-customer-ratings-count-component"
data-csa-c-slot-id="alf-reviews" data-csa-op-log-render=""
data-csa-c-layout="GRID" data-csa-c-asin="B081XSYZMS"
data-csa-c-id="6l5wc4-ngelan-hd9x4t-d4a2k7"><a aria-label="184 ratings"
class="a-link-normal s-underline-text s-underline-link-text s-link-style"
href="/sspa/click?ie=UTF8&amp;spc=MToxNzQwNTE0ODkzNDczNzk3OjE3NDkyOTk4MzM6c3BfYXRmOjIwMDA2NzY0ODgwMjc5ODo6MDo6&amp;url=%2FDeath-Star-R2D2-Balls-Printed%2Fdp%2FB081XSYZMS%2Fref%3Dsr_1_1_sspa%3Fcrid%3D3C1EXMXN59Q9G%26dib%3DeyJ2IjoiMSJ9.7tBl5bhZh59L9qIPZUe9SLa2fy_HvzboxuQxvrRcAc0VUXayi9fxQFsMLyFplDE9vMkIJbP76AVpa-5-fxhNza3DqhX4tss4NlB49WPi_dA00Hw6O8qK5pDzdetYlhGgOyXOLBe7mTG9oJ5W0wcvQhEVoX9mpJk_SGeqRLWGA0dBSjYCZtiyrY8_B-DP53S7fbYwiSYtq-g7sQDXKVadRpGvUyKq7yxA0SLsU42uvoqSGb0qcd6udL1wbnTEkKmwNjNSb7xIUb-8PyE7DTPMt1ScJksn70sFQMJNkM2aK5M.x9_jYvKPnSibV1d0umUStZBxlSTSXrzVIFKqFzS8c-U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749299833%26sprefix%3Dr2d2%252Caps%252C548%26sr%3D8-1-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9hdGY%26psc%3D1#customerReviews"><span
aria-hidden="true"
class="a-size-base s-underline-text">184</span> </a> </div>
</span></div>
<div class="a-row a-size-base"><span class="a-size-base a-color-secondary">50+
bought in past month</span></div>
</div>
<div data-cy="price-recipe"
class="a-section a-spacing-none a-spacing-top-small s-price-instructions-style">
<div class="a-row a-size-base a-color-base">
<div class="a-row"><span id="price-link" class="aok-offscreen">Price, product
page</span><a aria-describedby="price-link"
class="a-link-normal s-no-hover s-underline-text s-underline-link-text s-link-style a-text-normal"
href="/sspa/click?ie=UTF8&amp;spc=MToxNzQwNTE0ODkzNDczNzk3OjE3NDkyOTk4MzM6c3BfYXRmOjIwMDA2NzY0ODgwMjc5ODo6MDo6&amp;url=%2FDeath-Star-R2D2-Balls-Printed%2Fdp%2FB081XSYZMS%2Fref%3Dsr_1_1_sspa%3Fcrid%3D3C1EXMXN59Q9G%26dib%3DeyJ2IjoiMSJ9.7tBl5bhZh59L9qIPZUe9SLa2fy_HvzboxuQxvrRcAc0VUXayi9fxQFsMLyFplDE9vMkIJbP76AVpa-5-fxhNza3DqhX4tss4NlB49WPi_dA00Hw6O8qK5pDzdetYlhGgOyXOLBe7mTG9oJ5W0wcvQhEVoX9mpJk_SGeqRLWGA0dBSjYCZtiyrY8_B-DP53S7fbYwiSYtq-g7sQDXKVadRpGvUyKq7yxA0SLsU42uvoqSGb0qcd6udL1wbnTEkKmwNjNSb7xIUb-8PyE7DTPMt1ScJksn70sFQMJNkM2aK5M.x9_jYvKPnSibV1d0umUStZBxlSTSXrzVIFKqFzS8c-U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749299833%26sprefix%3Dr2d2%252Caps%252C548%26sr%3D8-1-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9hdGY%26psc%3D1"><span
class="a-price" data-a-size="xl" data-a-color="base"><span
class="a-offscreen">$29.95</span><span aria-hidden="true"><span
class="a-price-symbol">$</span><span
class="a-price-whole">29<span
class="a-price-decimal">.</span></span><span
class="a-price-fraction">95</span></span></span></a></div>
<div class="a-row"></div>
</div>
</div>
<div data-cy="delivery-recipe" class="a-section a-spacing-none a-spacing-top-micro">
<div class="a-row a-size-base a-color-secondary s-align-children-center"><span
aria-label="FREE delivery Thu, Jun 19 to Malaysia on $49 of eligible items"><span
class="a-color-base">FREE delivery </span><span
class="a-color-base a-text-bold">Thu, Jun 19 </span><span
class="a-color-base">to Malaysia on $49 of eligible items</span></span>
</div>
</div>
<div data-cy="certification-recipe"
class="a-section a-spacing-none a-spacing-top-micro">
<div class="a-row">
<div class="a-section a-spacing-none s-align-children-center">
<div class="a-section a-spacing-none s-pc-faceout-container">
<div>
<div class="s-align-children-center"><span class="a-declarative"
data-version-id="v2dwi5hq8xzthf26x0gg1mcl2oj"
data-render-id="r3o8bgr5zt3kmy2jv4su6fn4kyw"
data-action="s-pc-sidesheet-open"
data-csa-c-func-deps="aui-da-s-pc-sidesheet-open"
data-s-pc-sidesheet-open="{&quot;preloadDomId&quot;:&quot;pc-side-sheet-B081XSYZMS&quot;,&quot;popoverLabel&quot;:&quot;Product certifications&quot;,&quot;interactLoggingMetricsList&quot;:[&quot;provenanceCertifications_desktop_sbe_badge&quot;],&quot;closeButtonLabel&quot;:&quot;Close popup&quot;,&quot;dwellMetric&quot;:&quot;provenanceCertifications_desktop_sbe_badge_t&quot;}"
data-csa-c-type="widget"
data-csa-c-id="hdfxi6-bjlgup-5dql15-88t9ao"><a
data-cy="s-pc-faceout-badge"
class="a-link-normal s-no-underline s-pc-badge s-align-children-center aok-block"
href="javascript:void(0)" role="button">
<div
class="a-section s-pc-attribute-pill-text s-margin-bottom-none s-margin-bottom-none aok-block s-pc-certification-faceout">
<span class="faceout-image-view"></span><img alt=""
src="https://m.media-amazon.com/images/I/111mHoVK0kL._SS200_.png"
class="s-image" height="18px" width="18px">
<span class="a-size-base a-color-base">Small
Business</span>
<div
class="s-margin-bottom-none s-pc-sidesheet-chevron aok-nowrap">
<i class="a-icon a-icon-popover aok-align-center"
role="presentation"></i></div>
</div>
</a></span></div>
</div>
</div>
</div>
<div id="pc-side-sheet-B081XSYZMS"
class="a-section puis puis-v2dwi5hq8xzthf26x0gg1mcl2oj aok-hidden">
<div class="a-section s-pc-container-side-sheet">
<div class="s-align-children-center a-spacing-small">
<div class="s-align-children-center s-pc-certification"
role="heading" aria-level="2"><span
class="faceout-image-view"></span>
<div alt="" style="height: 24px; width: 24px;"
class="a-image-wrapper a-lazy-loaded a-manually-loaded s-image"
data-a-image-source="https://m.media-amazon.com/images/I/111mHoVK0kL._SS200_.png">
<noscript><img alt=""
src="https://m.media-amazon.com/images/I/111mHoVK0kL._SS200_.png"
height="24px" width="24px" /></noscript></div> <span
class="a-size-medium-plus a-color-base a-text-bold">Small
Business</span>
</div>
</div>
<div class="a-spacing-medium s-pc-link-container"><span
class="a-size-base a-color-secondary">Shop products from small
business brands sold in Amazons store. Discover more about the
small businesses partnering with Amazon and Amazons commitment
to empowering them.</span> <a
class="a-size-base a-link-normal s-link-style"
href="https://www.amazon.com/b/ref=s9_acss_bw_cg_sbp22c_1e1_w/ref=SBE_navbar_5?pf_rd_r=6W5X52VNZRB7GK1E1VX2&amp;pf_rd_p=56621c3d-cff4-45e1-9bf4-79bbeb8006fc&amp;pf_rd_m=ATVPDKIKX0DER&amp;pf_rd_s=merchandised-search-top-3&amp;pf_rd_t=30901&amp;pf_rd_i=17879387011&amp;node=18018208011">Learn
more</a> </div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</span>
</div>
</div>
</div>
</div>

View File

@@ -0,0 +1,217 @@
"""
C4A-Script API Usage Examples
Shows how to use the new Result-based API in various scenarios
"""
from c4a_compile import compile, validate, compile_file
from c4a_result import CompilationResult, ValidationResult
import json
print("C4A-Script API Usage Examples")
print("=" * 80)
# Example 1: Basic compilation
print("\n1. Basic Compilation")
print("-" * 40)
script = """
GO https://example.com
WAIT 2
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
REPEAT (SCROLL DOWN 300, 3)
"""
result = compile(script)
print(f"Success: {result.success}")
print(f"Statements generated: {len(result.js_code) if result.js_code else 0}")
# Example 2: Error handling
print("\n\n2. Error Handling")
print("-" * 40)
error_script = """
GO https://example.com
IF (EXISTS `.modal`) CLICK `.close`
undefined_procedure
"""
result = compile(error_script)
if not result.success:
# Access error details
error = result.first_error
print(f"Error on line {error.line}: {error.message}")
print(f"Error code: {error.code}")
# Show suggestions if available
if error.suggestions:
print("Suggestions:")
for suggestion in error.suggestions:
print(f" - {suggestion.message}")
# Example 3: Validation only
print("\n\n3. Validation (no code generation)")
print("-" * 40)
validation_script = """
PROC validate_form
IF (EXISTS `#email`) THEN TYPE "test@example.com"
PRESS Tab
ENDPROC
validate_form
"""
validation = validate(validation_script)
print(f"Valid: {validation.valid}")
if validation.errors:
print(f"Errors found: {len(validation.errors)}")
# Example 4: JSON output for UI
print("\n\n4. JSON Output for UI Integration")
print("-" * 40)
ui_script = """
CLICK button.submit
"""
result = compile(ui_script)
if not result.success:
# Get JSON for UI
error_json = result.to_dict()
print("Error data for UI:")
print(json.dumps(error_json["errors"][0], indent=2))
# Example 5: File compilation
print("\n\n5. File Compilation")
print("-" * 40)
# Create a test file
test_file = "test_script.c4a"
with open(test_file, "w") as f:
f.write("""
GO https://example.com
WAIT `.content` 5
CLICK `.main-button`
""")
result = compile_file(test_file)
print(f"File compilation: {'Success' if result.success else 'Failed'}")
if result.success:
print(f"Generated {len(result.js_code)} JavaScript statements")
# Clean up
import os
os.remove(test_file)
# Example 6: Batch processing
print("\n\n6. Batch Processing Multiple Scripts")
print("-" * 40)
scripts = [
"GO https://example1.com\nCLICK `.button`",
"GO https://example2.com\nWAIT 2",
"GO https://example3.com\nINVALID_CMD"
]
results = []
for i, script in enumerate(scripts, 1):
result = compile(script)
results.append(result)
status = "" if result.success else ""
print(f"Script {i}: {status}")
# Summary
successful = sum(1 for r in results if r.success)
print(f"\nBatch result: {successful}/{len(scripts)} successful")
# Example 7: Custom error formatting
print("\n\n7. Custom Error Formatting")
print("-" * 40)
def format_error_for_ide(error):
"""Format error for IDE integration"""
return f"{error.source_line}:{error.line}:{error.column}: {error.type.value}: {error.message} [{error.code}]"
error_script = "IF EXISTS `.button` THEN CLICK `.button`"
result = compile(error_script)
if not result.success:
error = result.first_error
print("IDE format:", format_error_for_ide(error))
print("Simple format:", error.simple_message)
print("Full format:", error.formatted_message)
# Example 8: Working with warnings (future feature)
print("\n\n8. Handling Warnings")
print("-" * 40)
# In the future, we might have warnings
result = compile("GO https://example.com\nWAIT 100") # Very long wait
print(f"Success: {result.success}")
print(f"Warnings: {len(result.warnings)}")
# Example 9: Metadata usage
print("\n\n9. Using Metadata")
print("-" * 40)
complex_script = """
PROC helper1
CLICK `.btn1`
ENDPROC
PROC helper2
CLICK `.btn2`
ENDPROC
GO https://example.com
helper1
helper2
"""
result = compile(complex_script)
if result.success:
print(f"Script metadata:")
for key, value in result.metadata.items():
print(f" {key}: {value}")
# Example 10: Integration patterns
print("\n\n10. Integration Patterns")
print("-" * 40)
# Web API endpoint simulation
def api_compile(request_body):
"""Simulate API endpoint"""
script = request_body.get("script", "")
result = compile(script)
response = {
"status": "success" if result.success else "error",
"data": result.to_dict()
}
return response
# CLI tool simulation
def cli_compile(script, output_format="text"):
"""Simulate CLI tool"""
result = compile(script)
if output_format == "json":
return result.to_json()
elif output_format == "simple":
if result.success:
return f"OK: {len(result.js_code)} statements"
else:
return f"ERROR: {result.first_error.simple_message}"
else:
return str(result)
# Test the patterns
api_response = api_compile({"script": "GO https://example.com"})
print(f"API response status: {api_response['status']}")
cli_output = cli_compile("WAIT 2", "simple")
print(f"CLI output: {cli_output}")
print("\n" + "=" * 80)
print("All examples completed successfully!")

View File

@@ -0,0 +1,53 @@
"""
C4A-Script Hello World
A concise example showing how to use the C4A-Script compiler
"""
from c4a_compile import compile
# Define your C4A-Script
script = """
GO https://example.com
WAIT `#content` 5
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
CLICK `button.submit`
"""
# Compile the script
result = compile(script)
# Check if compilation was successful
if result.success:
# Success! Use the generated JavaScript
print("✅ Compilation successful!")
print(f"Generated {len(result.js_code)} JavaScript statements:\n")
for i, js in enumerate(result.js_code, 1):
print(f"{i}. {js}\n")
# In real usage, you'd pass result.js_code to Crawl4AI:
# config = CrawlerRunConfig(js_code=result.js_code)
else:
# Error! Handle the compilation error
print("❌ Compilation failed!")
# Get the first error (there might be multiple)
error = result.first_error
# Show error details
print(f"Error at line {error.line}, column {error.column}")
print(f"Message: {error.message}")
# Show the problematic code
print(f"\nCode: {error.source_line}")
print(" " * (6 + error.column) + "^")
# Show suggestions if available
if error.suggestions:
print("\n💡 How to fix:")
for suggestion in error.suggestions:
print(f" {suggestion.message}")
# For debugging or logging, you can also get JSON
# error_json = result.to_json()

View File

@@ -0,0 +1,53 @@
"""
C4A-Script Hello World - Error Example
Shows how error handling works
"""
from c4a_compile import compile
# Define a script with an error (missing THEN)
script = """
GO https://example.com
WAIT `#content` 5
IF (EXISTS `.cookie-banner`) CLICK `.accept`
CLICK `button.submit`
"""
# Compile the script
result = compile(script)
# Check if compilation was successful
if result.success:
# Success! Use the generated JavaScript
print("✅ Compilation successful!")
print(f"Generated {len(result.js_code)} JavaScript statements:\n")
for i, js in enumerate(result.js_code, 1):
print(f"{i}. {js}\n")
# In real usage, you'd pass result.js_code to Crawl4AI:
# config = CrawlerRunConfig(js_code=result.js_code)
else:
# Error! Handle the compilation error
print("❌ Compilation failed!")
# Get the first error (there might be multiple)
error = result.first_error
# Show error details
print(f"Error at line {error.line}, column {error.column}")
print(f"Message: {error.message}")
# Show the problematic code
print(f"\nCode: {error.source_line}")
print(" " * (6 + error.column) + "^")
# Show suggestions if available
if error.suggestions:
print("\n💡 How to fix:")
for suggestion in error.suggestions:
print(f" {suggestion.message}")
# For debugging or logging, you can also get JSON
# error_json = result.to_json()

View File

@@ -0,0 +1,285 @@
"""
Demonstration of C4A-Script integration with Crawl4AI
Shows various use cases and features
"""
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai import c4a_compile, CompilationResult
async def example_basic_usage():
"""Basic C4A-Script usage with Crawl4AI"""
print("\n" + "="*60)
print("Example 1: Basic C4A-Script Usage")
print("="*60)
# Define your automation script
c4a_script = """
# Wait for page to load
WAIT `body` 2
# Handle cookie banner if present
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-btn`
# Scroll down to load more content
SCROLL DOWN 500
WAIT 1
# Click load more button if exists
IF (EXISTS `.load-more`) THEN CLICK `.load-more`
"""
# Create crawler config with C4A script
config = CrawlerRunConfig(
url="https://example.com",
c4a_script=c4a_script,
wait_for="css:.content",
verbose=False
)
print("✅ C4A Script compiled successfully!")
print(f"Generated {len(config.js_code)} JavaScript commands")
# In production, you would run:
# async with AsyncWebCrawler() as crawler:
# result = await crawler.arun(config=config)
async def example_form_filling():
"""Form filling with C4A-Script"""
print("\n" + "="*60)
print("Example 2: Form Filling with C4A-Script")
print("="*60)
# Form automation script
form_script = """
# Set form values
SET email = "test@example.com"
SET message = "This is a test message"
# Fill the form
CLICK `#email-input`
TYPE $email
CLICK `#message-textarea`
TYPE $message
# Submit the form
CLICK `button[type="submit"]`
# Wait for success message
WAIT `.success-message` 10
"""
config = CrawlerRunConfig(
url="https://example.com/contact",
c4a_script=form_script
)
print("✅ Form filling script ready")
print("Script will:")
print(" - Fill email field")
print(" - Fill message textarea")
print(" - Submit form")
print(" - Wait for confirmation")
async def example_dynamic_loading():
"""Handle dynamic content loading"""
print("\n" + "="*60)
print("Example 3: Dynamic Content Loading")
print("="*60)
# Script for infinite scroll or pagination
pagination_script = """
# Initial wait
WAIT `.product-list` 5
# Load all products by clicking "Load More" repeatedly
REPEAT (CLICK `.load-more`, `document.querySelector('.load-more') !== null`)
# Alternative: Scroll to load (infinite scroll)
# REPEAT (SCROLL DOWN 1000, `document.querySelectorAll('.product').length < 100`)
# Extract count
EVAL `console.log('Products loaded: ' + document.querySelectorAll('.product').length)`
"""
config = CrawlerRunConfig(
url="https://example.com/products",
c4a_script=pagination_script,
screenshot=True # Capture final state
)
print("✅ Dynamic loading script ready")
print("Script will load all products by repeatedly clicking 'Load More'")
async def example_multi_step_workflow():
"""Complex multi-step workflow with procedures"""
print("\n" + "="*60)
print("Example 4: Multi-Step Workflow with Procedures")
print("="*60)
# Complex workflow with reusable procedures
workflow_script = """
# Define login procedure
PROC login
CLICK `#username`
TYPE "demo_user"
CLICK `#password`
TYPE "demo_pass"
CLICK `#login-btn`
WAIT `.dashboard` 10
ENDPROC
# Define search procedure
PROC search_product
CLICK `.search-box`
TYPE "laptop"
PRESS Enter
WAIT `.search-results` 5
ENDPROC
# Main workflow
GO https://example.com
login
search_product
# Process results
IF (EXISTS `.no-results`) THEN EVAL `console.log('No products found')`
ELSE REPEAT (CLICK `.add-to-cart`, 3)
"""
# Compile to check for errors
result = c4a_compile(workflow_script)
if result.success:
print("✅ Complex workflow compiled successfully!")
print("Workflow includes:")
print(" - Login procedure")
print(" - Product search")
print(" - Conditional cart additions")
config = CrawlerRunConfig(
url="https://example.com",
c4a_script=workflow_script
)
else:
print("❌ Compilation error:")
error = result.first_error
print(f" Line {error.line}: {error.message}")
async def example_error_handling():
"""Demonstrate error handling"""
print("\n" + "="*60)
print("Example 5: Error Handling")
print("="*60)
# Script with intentional error
bad_script = """
WAIT body 2
CLICK button
IF (EXISTS .modal) CLICK .close
"""
try:
config = CrawlerRunConfig(
url="https://example.com",
c4a_script=bad_script
)
except ValueError as e:
print("✅ Error caught as expected:")
print(f" {e}")
# Fixed version
good_script = """
WAIT `body` 2
CLICK `button`
IF (EXISTS `.modal`) THEN CLICK `.close`
"""
config = CrawlerRunConfig(
url="https://example.com",
c4a_script=good_script
)
print("\n✅ Fixed script compiled successfully!")
async def example_combining_with_extraction():
"""Combine C4A-Script with extraction strategies"""
print("\n" + "="*60)
print("Example 6: C4A-Script + Extraction Strategies")
print("="*60)
from crawl4ai import JsonCssExtractionStrategy
# Script to prepare page for extraction
prep_script = """
# Expand all collapsed sections
REPEAT (CLICK `.expand-btn`, `document.querySelectorAll('.expand-btn:not(.expanded)').length > 0`)
# Load all comments
IF (EXISTS `.load-comments`) THEN CLICK `.load-comments`
WAIT `.comments-section` 5
# Close any popups
IF (EXISTS `.popup-close`) THEN CLICK `.popup-close`
"""
# Define extraction schema
schema = {
"name": "article",
"selector": "article.main",
"fields": {
"title": {"selector": "h1", "type": "text"},
"content": {"selector": ".content", "type": "text"},
"comments": {
"selector": ".comment",
"type": "list",
"fields": {
"author": {"selector": ".author", "type": "text"},
"text": {"selector": ".text", "type": "text"}
}
}
}
}
config = CrawlerRunConfig(
url="https://example.com/article",
c4a_script=prep_script,
extraction_strategy=JsonCssExtractionStrategy(schema),
wait_for="css:.comments-section"
)
print("✅ Combined C4A + Extraction ready")
print("Workflow:")
print(" 1. Expand collapsed sections")
print(" 2. Load comments")
print(" 3. Extract structured data")
async def main():
"""Run all examples"""
print("\n🚀 C4A-Script + Crawl4AI Integration Demo\n")
# Run all examples
await example_basic_usage()
await example_form_filling()
await example_dynamic_loading()
await example_multi_step_workflow()
await example_error_handling()
await example_combining_with_extraction()
print("\n" + "="*60)
print("✅ All examples completed successfully!")
print("="*60)
print("\nTo run actual crawls, uncomment the AsyncWebCrawler sections")
print("or create your own scripts using these examples as templates.")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,89 @@
#!/usr/bin/env python3
"""
Hello World Example: LLM-Generated C4A-Script
This example shows how to use the new generate_script() function to automatically
create C4A-Script automation from natural language descriptions and HTML.
"""
from crawl4ai.script.c4a_compile import C4ACompiler
def main():
print("🤖 C4A-Script Generation Hello World")
print("=" * 50)
# Example 1: Simple login form
html = """
<html>
<body>
<form id="login">
<input id="email" type="email" placeholder="Email">
<input id="password" type="password" placeholder="Password">
<button id="submit">Login</button>
</form>
</body>
</html>
"""
goal = "Fill in email 'user@example.com', password 'secret123', and submit the form"
print("📝 Goal:", goal)
print("🌐 HTML: Simple login form")
print()
# Generate C4A-Script
print("🔧 Generated C4A-Script:")
print("-" * 30)
c4a_script = C4ACompiler.generate_script(
html=html,
query=goal,
mode="c4a"
)
print(c4a_script)
print()
# Generate JavaScript
print("🔧 Generated JavaScript:")
print("-" * 30)
js_script = C4ACompiler.generate_script(
html=html,
query=goal,
mode="js"
)
print(js_script)
print()
# Example 2: Simple button click
html2 = """
<html>
<body>
<div class="content">
<h1>Welcome!</h1>
<button id="start-btn" class="primary">Get Started</button>
</div>
</body>
</html>
"""
goal2 = "Click the 'Get Started' button"
print("=" * 50)
print("📝 Goal:", goal2)
print("🌐 HTML: Simple button")
print()
print("🔧 Generated C4A-Script:")
print("-" * 30)
c4a_script2 = C4ACompiler.generate_script(
html=html2,
query=goal2,
mode="c4a"
)
print(c4a_script2)
print()
print("✅ Done! The LLM automatically converted natural language goals")
print(" into executable automation scripts.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,111 @@
[
{
"repository_name": "unclecode/crawl4ai",
"repository_owner": "unclecode/crawl4ai",
"repository_url": "/unclecode/crawl4ai",
"description": "\ud83d\ude80\ud83e\udd16Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here:https://discord.gg/jP8KfhDhyN",
"primary_language": "Python",
"star_count": "45.1k",
"topics": [],
"last_updated": "23 hours ago"
},
{
"repository_name": "coleam00/mcp-crawl4ai-rag",
"repository_owner": "coleam00/mcp-crawl4ai-rag",
"repository_url": "/coleam00/mcp-crawl4ai-rag",
"description": "Web Crawling and RAG Capabilities for AI Agents and AI Coding Assistants",
"primary_language": "Python",
"star_count": "748",
"topics": [],
"last_updated": "yesterday"
},
{
"repository_name": "pdichone/crawl4ai-rag-system",
"repository_owner": "pdichone/crawl4ai-rag-system",
"repository_url": "/pdichone/crawl4ai-rag-system",
"primary_language": "Python",
"star_count": "44",
"topics": [],
"last_updated": "on 21 Jan"
},
{
"repository_name": "weidwonder/crawl4ai-mcp-server",
"repository_owner": "weidwonder/crawl4ai-mcp-server",
"repository_url": "/weidwonder/crawl4ai-mcp-server",
"description": "\u7528\u4e8e\u63d0\u4f9b\u7ed9\u672c\u5730\u5f00\u53d1\u8005\u7684 LLM\u7684\u9ad8\u6548\u4e92\u8054\u7f51\u641c\u7d22&\u5185\u5bb9\u83b7\u53d6\u7684MCP Server\uff0c \u8282\u7701\u4f60\u7684token",
"primary_language": "Python",
"star_count": "87",
"topics": [],
"last_updated": "24 days ago"
},
{
"repository_name": "leonardogrig/crawl4ai-deepseek-example",
"repository_owner": "leonardogrig/crawl4ai-deepseek-example",
"repository_url": "/leonardogrig/crawl4ai-deepseek-example",
"primary_language": "Python",
"star_count": "29",
"topics": [],
"last_updated": "on 18 Jan"
},
{
"repository_name": "laurentvv/crawl4ai-mcp",
"repository_owner": "laurentvv/crawl4ai-mcp",
"repository_url": "/laurentvv/crawl4ai-mcp",
"description": "Web crawling tool that integrates with AI assistants via the MCP",
"primary_language": "Python",
"star_count": "10",
"topics": [
{},
{},
{},
{},
{}
],
"last_updated": "on 16 Mar"
},
{
"repository_name": "kaymen99/ai-web-scraper",
"repository_owner": "kaymen99/ai-web-scraper",
"repository_url": "/kaymen99/ai-web-scraper",
"description": "AI web scraper built withCrawl4AIfor extracting structured leads data from websites.",
"primary_language": "Python",
"star_count": "30",
"topics": [
{},
{},
{},
{},
{}
],
"last_updated": "on 13 Feb"
},
{
"repository_name": "atakkant/ai_web_crawler",
"repository_owner": "atakkant/ai_web_crawler",
"repository_url": "/atakkant/ai_web_crawler",
"description": "crawl4ai, DeepSeek, Groq",
"primary_language": "Python",
"star_count": "9",
"topics": [],
"last_updated": "on 19 Feb"
},
{
"repository_name": "Croups/auto-scraper-with-llms",
"repository_owner": "Croups/auto-scraper-with-llms",
"repository_url": "/Croups/auto-scraper-with-llms",
"description": "Web scraping AI that leverages thecrawl4ailibrary to extract structured data from web pages using various large language models (LLMs).",
"primary_language": "Python",
"star_count": "49",
"topics": [],
"last_updated": "on 8 Apr"
},
{
"repository_name": "leonardogrig/crawl4ai_llm_examples",
"repository_owner": "leonardogrig/crawl4ai_llm_examples",
"repository_url": "/leonardogrig/crawl4ai_llm_examples",
"primary_language": "Python",
"star_count": "8",
"topics": [],
"last_updated": "on 29 Jan"
}
]

View File

@@ -0,0 +1,66 @@
{
"name": "GitHub Repository Cards",
"baseSelector": "div.Box-sc-g0xbh4-0.iwUbcA",
"fields": [
{
"name": "repository_name",
"selector": "div.search-title a span",
"type": "text",
"transform": "strip"
},
{
"name": "repository_owner",
"selector": "div.search-title a span",
"type": "text",
"transform": "split",
"pattern": "/"
},
{
"name": "repository_url",
"selector": "div.search-title a",
"type": "attribute",
"attribute": "href",
"transform": "prepend",
"pattern": "https://github.com"
},
{
"name": "description",
"selector": "div.dcdlju span",
"type": "text"
},
{
"name": "primary_language",
"selector": "ul.bZkODq li span[aria-label]",
"type": "text"
},
{
"name": "star_count",
"selector": "ul.bZkODq li a[href*='stargazers'] span",
"type": "text",
"transform": "strip"
},
{
"name": "topics",
"type": "list",
"selector": "div.jgRnBg div a",
"fields": [
{
"name": "topic_name",
"selector": "a",
"type": "text"
}
]
},
{
"name": "last_updated",
"selector": "ul.bZkODq li span[title]",
"type": "text"
},
{
"name": "has_sponsor_button",
"selector": "button[aria-label*='Sponsor']",
"type": "text",
"transform": "exists"
}
]
}

View File

@@ -0,0 +1,39 @@
(async () => {
const waitForElement = (selector, timeout = 10000) => new Promise((resolve, reject) => {
const el = document.querySelector(selector);
if (el) return resolve(el);
const observer = new MutationObserver(() => {
const el = document.querySelector(selector);
if (el) {
observer.disconnect();
resolve(el);
}
});
observer.observe(document.body, { childList: true, subtree: true });
setTimeout(() => {
observer.disconnect();
reject(new Error(`Timeout waiting for ${selector}`));
}, timeout);
});
try {
const searchInput = await waitForElement('#adv_code_search input[type="text"]');
searchInput.value = 'crawl4AI';
searchInput.dispatchEvent(new Event('input', { bubbles: true }));
const languageSelect = await waitForElement('#search_language');
languageSelect.value = 'Python';
languageSelect.dispatchEvent(new Event('change', { bubbles: true }));
const starsInput = await waitForElement('#search_stars');
starsInput.value = '>10000';
starsInput.dispatchEvent(new Event('input', { bubbles: true }));
const searchButton = await waitForElement('#adv_code_search button[type="submit"]');
searchButton.click();
await waitForElement('.codesearch-results, #search-results');
} catch (e) {
console.error('Search script failed:', e.message);
}
})();

View File

@@ -0,0 +1,211 @@
#!/usr/bin/env python3
"""
GitHub Advanced Search Example using Crawl4AI
This example demonstrates:
1. Using LLM to generate C4A-Script from HTML snippets
2. Single arun() call with navigation, search form filling, and extraction
3. JSON CSS extraction for structured repository data
4. Complete workflow: navigate → fill form → submit → extract results
Requirements:
- Crawl4AI with generate_script support
- LLM API key (configured in environment)
"""
import asyncio
import json
import os
from pathlib import Path
from typing import List, Dict, Any
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai.script.c4a_compile import C4ACompiler
class GitHubSearchScraper:
def __init__(self):
self.base_dir = Path(__file__).parent
self.search_script_path = self.base_dir / "generated_search_script.js"
self.schema_path = self.base_dir / "generated_result_schema.json"
self.results_path = self.base_dir / "extracted_repositories.json"
self.session_id = "github_search_session"
async def generate_search_script(self) -> str:
"""Generate JavaScript for GitHub advanced search interaction"""
print("🔧 Generating search script from search_form.html...")
# Check if already generated
if self.search_script_path.exists():
print("✅ Using cached search script")
return self.search_script_path.read_text()
# Read the search form HTML
search_form_html = (self.base_dir / "search_form.html").read_text()
# Generate script using LLM
search_goal = """
Search for crawl4AI repositories written in Python with more than 10000 stars:
1. Wait for the main search input to be visible
2. Type "crawl4AI" into the main search box
3. Select "Python" from the language dropdown (#search_language)
4. Type ">10000" into the stars input field (#search_stars)
5. Click the search button to submit the form
6. Wait for the search results to appear
"""
try:
script = C4ACompiler.generate_script(
html=search_form_html,
query=search_goal,
mode="js"
)
# Save for future use
self.search_script_path.write_text(script)
print("✅ Search script generated and saved!")
print(f"📄 Script preview:\n{script[:500]}...")
return script
except Exception as e:
print(f"❌ Error generating search script: {e}")
raise
async def generate_result_schema(self) -> Dict[str, Any]:
"""Generate JSON CSS extraction schema from result HTML"""
print("\n🔧 Generating result extraction schema...")
# Check if already generated
if self.schema_path.exists():
print("✅ Using cached extraction schema")
return json.loads(self.schema_path.read_text())
# Read the result HTML
result_html = (self.base_dir / "result.html").read_text()
# Generate extraction schema using LLM
schema_goal = """
Create a JSON CSS extraction schema to extract from each repository card:
- Repository name (the repository name only, not including owner)
- Repository owner (organization or username)
- Repository URL (full GitHub URL)
- Description
- Primary programming language
- Star count (numeric value)
- Topics/tags (array of topic names)
- Last updated (time ago string)
- Whether it has a sponsor button
The schema should handle multiple repository results on the search results page.
"""
try:
# Generate schema
schema = JsonCssExtractionStrategy.generate_schema(
html=result_html,
query=schema_goal,
)
# Save for future use
self.schema_path.write_text(json.dumps(schema, indent=2))
print("✅ Extraction schema generated and saved!")
print(f"📄 Schema fields: {[f['name'] for f in schema['fields']]}")
return schema
except Exception as e:
print(f"❌ Error generating schema: {e}")
raise
async def crawl_github(self):
"""Main crawling logic with single arun() call"""
print("\n🚀 Starting GitHub repository search...")
# Generate scripts and schemas
search_script = await self.generate_search_script()
result_schema = await self.generate_result_schema()
# Configure browser (headless=False to see the action)
browser_config = BrowserConfig(
headless=False,
verbose=True,
viewport_width=1920,
viewport_height=1080
)
async with AsyncWebCrawler(config=browser_config) as crawler:
print("\n📍 Navigating to GitHub advanced search and executing search...")
# Single call: Navigate, execute search, and extract results
search_config = CrawlerRunConfig(
session_id=self.session_id,
js_code=search_script, # Execute generated JS
# wait_for="[data-testid='results-list']", # Wait for search results
wait_for=".Box-sc-g0xbh4-0.iwUbcA", # Wait for search results
extraction_strategy=JsonCssExtractionStrategy(schema=result_schema),
delay_before_return_html=3.0, # Give time for results to fully load
cache_mode=CacheMode.BYPASS # Don't cache for fresh results
)
result = await crawler.arun(
url="https://github.com/search/advanced",
config=search_config
)
if not result.success:
print("❌ Failed to search GitHub")
print(f"Error: {result.error_message}")
return
print("✅ Search and extraction completed successfully!")
# Extract and save results
if result.extracted_content:
repositories = json.loads(result.extracted_content)
print(f"\n🔍 Found {len(repositories)} repositories matching criteria")
# Save results
self.results_path.write_text(
json.dumps(repositories, indent=2)
)
print(f"💾 Results saved to: {self.results_path}")
# Print sample results
print("\n📊 Sample Results:")
for i, repo in enumerate(repositories[:5], 1):
print(f"\n{i}. {repo.get('owner', 'Unknown')}/{repo.get('name', 'Unknown')}")
print(f" Description: {repo.get('description', 'No description')[:80]}...")
print(f" Language: {repo.get('language', 'Unknown')}")
print(f" Stars: {repo.get('stars', 'Unknown')}")
print(f" Updated: {repo.get('last_updated', 'Unknown')}")
if repo.get('topics'):
print(f" Topics: {', '.join(repo['topics'][:5])}")
print(f" URL: {repo.get('url', 'Unknown')}")
else:
print("❌ No repositories extracted")
# Save screenshot for reference
if result.screenshot:
screenshot_path = self.base_dir / "search_results_screenshot.png"
with open(screenshot_path, "wb") as f:
f.write(result.screenshot)
print(f"\n📸 Screenshot saved to: {screenshot_path}")
async def main():
"""Run the GitHub search scraper"""
scraper = GitHubSearchScraper()
await scraper.crawl_github()
print("\n🎉 GitHub search example completed!")
print("Check the generated files:")
print(" - generated_search_script.js")
print(" - generated_result_schema.json")
print(" - extracted_repositories.json")
print(" - search_results_screenshot.png")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,54 @@
<div class="Box-sc-g0xbh4-0 iwUbcA"><div class="Box-sc-g0xbh4-0 cSURfY"><div class="Box-sc-g0xbh4-0 gPrlij"><h3 class="Box-sc-g0xbh4-0 cvnppv"><div class="Box-sc-g0xbh4-0 kYLlPM"><div class="Box-sc-g0xbh4-0 eurdCD"><img data-component="Avatar" class="prc-Avatar-Avatar-ZRS-m" alt="" data-square="" width="20" height="20" src="https://github.com/TheAlgorithms.png?size=40" data-testid="github-avatar" style="--avatarSize-regular: 20px;"></div><div class="Box-sc-g0xbh4-0 MHoGG search-title"><a class="prc-Link-Link-85e08" href="/TheAlgorithms/Python"><span class="Box-sc-g0xbh4-0 kzfhBO search-match prc-Text-Text-0ima0">TheAlgorithms/<em>Python</em></span></a></div></div></h3><div class="Box-sc-g0xbh4-0 dcdlju"><span class="Box-sc-g0xbh4-0 gKFdvh search-match prc-Text-Text-0ima0">All Algorithms implemented in <em>Python</em></span></div><div class="Box-sc-g0xbh4-0 jgRnBg"><div><a class="Box-sc-g0xbh4-0 hIVEGR prc-Link-Link-85e08" href="/topics/python">python</a></div><div><a class="Box-sc-g0xbh4-0 hIVEGR prc-Link-Link-85e08" href="/topics/education">education</a></div><div><a class="Box-sc-g0xbh4-0 hIVEGR prc-Link-Link-85e08" href="/topics/algorithm">algorithm</a></div><div><a class="Box-sc-g0xbh4-0 hIVEGR prc-Link-Link-85e08" href="/topics/practice">practice</a></div><div><a class="Box-sc-g0xbh4-0 hIVEGR prc-Link-Link-85e08" href="/topics/interview">interview</a></div></div><ul class="Box-sc-g0xbh4-0 bZkODq"><li class="Box-sc-g0xbh4-0 eCfCAC"><div class="Box-sc-g0xbh4-0 hjDqIa"><div class="Box-sc-g0xbh4-0 fwSYsx"></div></div><span aria-label="Python language">Python</span></li><span class="Box-sc-g0xbh4-0 eXQoFa prc-Text-Text-0ima0" aria-hidden="true">·</span><li class="Box-sc-g0xbh4-0 eCfCAC"><a class="Box-sc-g0xbh4-0 iPuHRc prc-Link-Link-85e08" href="/TheAlgorithms/Python/stargazers" aria-label="201161 stars"><svg aria-hidden="true" focusable="false" class="octicon octicon-star Octicon-sc-9kayk9-0 kHVtWu" viewBox="0 0 16 16" width="16" height="16" fill="currentColor" display="inline-block" overflow="visible" style="vertical-align: text-bottom;"><path d="M8 .25a.75.75 0 0 1 .673.418l1.882 3.815 4.21.612a.75.75 0 0 1 .416 1.279l-3.046 2.97.719 4.192a.751.751 0 0 1-1.088.791L8 12.347l-3.766 1.98a.75.75 0 0 1-1.088-.79l.72-4.194L.818 6.374a.75.75 0 0 1 .416-1.28l4.21-.611L7.327.668A.75.75 0 0 1 8 .25Zm0 2.445L6.615 5.5a.75.75 0 0 1-.564.41l-3.097.45 2.24 2.184a.75.75 0 0 1 .216.664l-.528 3.084 2.769-1.456a.75.75 0 0 1 .698 0l2.77 1.456-.53-3.084a.75.75 0 0 1 .216-.664l2.24-2.183-3.096-.45a.75.75 0 0 1-.564-.41L8 2.694Z"></path></svg><span class="prc-Text-Text-0ima0">201k</span></a></li><span class="Box-sc-g0xbh4-0 eXQoFa prc-Text-Text-0ima0" aria-hidden="true">·</span><li class="Box-sc-g0xbh4-0 eCfCAC"><span>Updated <div title="3 Jun 2025, 01:57 GMT+8" class="Truncate__StyledTruncate-sc-23o1d2-0 liVpTx"><span class="prc-Text-Text-0ima0" title="3 Jun 2025, 01:57 GMT+8">4 days ago</span></div></span></li></ul></div><div class="Box-sc-g0xbh4-0 gtlRHe"><div class="Box-sc-g0xbh4-0 fvaNTI"><button type="button" class="prc-Button-ButtonBase-c50BI" data-loading="false" data-size="small" data-variant="default" aria-describedby=":r1c:-loading-announcement"><span data-component="buttonContent" data-align="center" class="prc-Button-ButtonContent-HKbr-"><span data-component="leadingVisual" class="prc-Button-Visual-2epfX prc-Button-VisualWrap-Db-eB"><svg aria-hidden="true" focusable="false" class="octicon octicon-star" viewBox="0 0 16 16" width="16" height="16" fill="currentColor" display="inline-block" overflow="visible" style="vertical-align: text-bottom;"><path d="M8 .25a.75.75 0 0 1 .673.418l1.882 3.815 4.21.612a.75.75 0 0 1 .416 1.279l-3.046 2.97.719 4.192a.751.751 0 0 1-1.088.791L8 12.347l-3.766 1.98a.75.75 0 0 1-1.088-.79l.72-4.194L.818 6.374a.75.75 0 0 1 .416-1.28l4.21-.611L7.327.668A.75.75 0 0 1 8 .25Zm0 2.445L6.615 5.5a.75.75 0 0 1-.564.41l-3.097.45 2.24 2.184a.75.75 0 0 1 .216.664l-.528 3.084 2.769-1.456a.75.75 0 0 1 .698 0l2.77 1.456-.53-3.084a.75.75 0 0 1 .216-.664l2.24-2.183-3.096-.45a.75.75 0 0 1-.564-.41L8 2.694Z"></path></svg></span><span data-component="text" class="prc-Button-Label-pTQ3x">Star</span></span></button></div><div class="Box-sc-g0xbh4-0 llZEgI"><div class="Box-sc-g0xbh4-0"> <button id="dialog-show-funding-links-modal-TheAlgorithms-Python" aria-label="Sponsor TheAlgorithms/Python" data-show-dialog-id="funding-links-modal-TheAlgorithms-Python" type="button" data-view-component="true" class="Button--secondary Button--small Button"> <span class="Button-content">
<span class="Button-label"><svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-heart icon-sponsor mr-1 color-fg-sponsors">
<path d="m8 14.25.345.666a.75.75 0 0 1-.69 0l-.008-.004-.018-.01a7.152 7.152 0 0 1-.31-.17 22.055 22.055 0 0 1-3.434-2.414C2.045 10.731 0 8.35 0 5.5 0 2.836 2.086 1 4.25 1 5.797 1 7.153 1.802 8 3.02 8.847 1.802 10.203 1 11.75 1 13.914 1 16 2.836 16 5.5c0 2.85-2.045 5.231-3.885 6.818a22.066 22.066 0 0 1-3.744 2.584l-.018.01-.006.003h-.002ZM4.25 2.5c-1.336 0-2.75 1.164-2.75 3 0 2.15 1.58 4.144 3.365 5.682A20.58 20.58 0 0 0 8 13.393a20.58 20.58 0 0 0 3.135-2.211C12.92 9.644 14.5 7.65 14.5 5.5c0-1.836-1.414-3-2.75-3-1.373 0-2.609.986-3.029 2.456a.749.749 0 0 1-1.442 0C6.859 3.486 5.623 2.5 4.25 2.5Z"></path>
</svg> <span data-view-component="true">Sponsor</span></span>
</span>
</button>
<dialog-helper>
<dialog id="funding-links-modal-TheAlgorithms-Python" aria-modal="true" aria-labelledby="funding-links-modal-TheAlgorithms-Python-title" aria-describedby="funding-links-modal-TheAlgorithms-Python-description" data-view-component="true" class="Overlay Overlay-whenNarrow Overlay--size-medium Overlay--motion-scaleFade Overlay--disableScroll">
<div data-view-component="true" class="Overlay-header">
<div class="Overlay-headerContentWrap">
<div class="Overlay-titleWrap">
<h1 class="Overlay-title " id="funding-links-modal-TheAlgorithms-Python-title">
Sponsor TheAlgorithms/Python
</h1>
</div>
<div class="Overlay-actionWrap">
<button data-close-dialog-id="funding-links-modal-TheAlgorithms-Python" aria-label="Close" type="button" data-view-component="true" class="close-button Overlay-closeButton"><svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-x">
<path d="M3.72 3.72a.75.75 0 0 1 1.06 0L8 6.94l3.22-3.22a.749.749 0 0 1 1.275.326.749.749 0 0 1-.215.734L9.06 8l3.22 3.22a.749.749 0 0 1-.326 1.275.749.749 0 0 1-.734-.215L8 9.06l-3.22 3.22a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042L6.94 8 3.72 4.78a.75.75 0 0 1 0-1.06Z"></path>
</svg></button>
</div>
</div>
</div>
<scrollable-region data-labelled-by="funding-links-modal-TheAlgorithms-Python-title" data-catalyst="" style="overflow: auto;">
<div data-view-component="true" class="Overlay-body"> <div class="text-left f5">
<div class="pt-3 color-bg-overlay">
<h5 class="flex-auto mb-3 mt-0">External links</h5>
<div class="d-flex mb-3">
<div class="circle mr-2 border d-flex flex-justify-center flex-items-center flex-shrink-0" style="width:24px;height:24px;">
<img width="16" height="16" class="octicon rounded-2 d-block" alt="liberapay" src="https://github.githubassets.com/assets/liberapay-48108ded7267.svg">
</div>
<div class="flex-auto min-width-0">
<a target="_blank" data-ga-click="Dashboard, click, Nav menu - item:org-profile context:organization" data-hydro-click="{&quot;event_type&quot;:&quot;sponsors.repo_funding_links_link_click&quot;,&quot;payload&quot;:{&quot;platform&quot;:{&quot;platform_type&quot;:&quot;LIBERAPAY&quot;,&quot;platform_url&quot;:&quot;https://liberapay.com/TheAlgorithms&quot;},&quot;platforms&quot;:[{&quot;platform_type&quot;:&quot;LIBERAPAY&quot;,&quot;platform_url&quot;:&quot;https://liberapay.com/TheAlgorithms&quot;}],&quot;repo_id&quot;:63476337,&quot;owner_id&quot;:20487725,&quot;user_id&quot;:12494079,&quot;originating_url&quot;:&quot;https://github.com/TheAlgorithms/Python/funding_links?fragment=1&quot;}}" data-hydro-click-hmac="123b5aa7d5ffff5ef0530f8e7fbaebcb564e8de1af26f1b858a19b0e1d4f9e5f" href="https://liberapay.com/TheAlgorithms"><span>liberapay.com/<strong>TheAlgorithms</strong></span></a>
</div>
</div>
</div>
<div class="text-small p-3 border-top">
<p class="my-0">
<a class="Link--inTextBlock" href="https://docs.github.com/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/displaying-a-sponsor-button-in-your-repository">Learn more about funding links in repositories</a>.
</p>
<p class="my-0">
<a class="Link--secondary" href="/contact/report-abuse?report=TheAlgorithms%2FPython+%28Repository+Funding+Links%29">Report abuse</a>
</p>
</div>
</div>
</div>
</scrollable-region>
</dialog></dialog-helper>
</div></div></div></div></div>

View File

@@ -0,0 +1,336 @@
<form id="search_form" class="search_repos" data-turbo="false" action="/search" accept-charset="UTF-8" method="get">
<div class="pagehead codesearch-head color-border-muted">
<div class="container-lg p-responsive d-flex flex-column flex-md-row">
<h1 class="flex-shrink-0" id="search-title">Advanced search</h1>
<div class="search-form-fluid flex-auto d-flex flex-column flex-md-row pt-2 pt-md-0" id="adv_code_search">
<div class="flex-auto pr-md-2">
<label class="form-control search-page-label js-advanced-search-label">
<input aria-labelledby="search-title" class="form-control input-block search-page-input js-advanced-search-input js-advanced-search-prefix" data-search-prefix="" type="text" value="">
<p class="completed-query js-advanced-query top-0 right-0 left-0"><span></span> </p>
</label>
<input class="js-search-query" type="hidden" name="q" value="">
<input class="js-type-value" type="hidden" name="type" value="Repositories">
<input type="hidden" name="ref" value="advsearch">
</div>
<div class="d-flex d-md-block flex-shrink-0 pt-2 pt-md-0">
<button type="submit" data-view-component="true" class="btn flex-auto"> Search
</button>
</div>
</div>
</div>
</div>
<div class="container-lg p-responsive advanced-search-form">
<fieldset class="pb-3 mb-4 border-bottom color-border-muted min-width-0">
<h3>Advanced options</h3>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_from">From these owners</label></dt>
<dd><input id="search_from" type="text" class="form-control js-advanced-search-prefix" placeholder="github, atom, electron, octokit" data-search-prefix="user:"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_repos">In these repositories</label></dt>
<dd><input id="search_repos" type="text" class="form-control js-advanced-search-prefix" value="" placeholder="twbs/bootstrap, rails/rails" data-search-prefix="repo:"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_date">Created on the dates</label></dt>
<dd><input id="search_date" type="text" class="form-control js-advanced-search-prefix" value="" placeholder=">YYYY-MM-DD, YYYY-MM-DD" data-search-prefix="created:"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_language">Written in this language</label></dt>
<dd>
<select id="search_language" name="l" class="form-select js-advanced-search-prefix" data-search-prefix="language:">
<option value="">Any language</option>
<optgroup label="Popular">
<option value="C">C</option>
<option value="C#">C#</option>
<option value="C++">C++</option>
<option value="CoffeeScript">CoffeeScript</option>
<option value="CSS">CSS</option>
<option value="Dart">Dart</option>
<option value="DM">DM</option>
<option value="Elixir">Elixir</option>
<option value="Go">Go</option>
<option value="Groovy">Groovy</option>
<option value="HTML">HTML</option>
<option value="Java">Java</option>
<option value="JavaScript">JavaScript</option>
<option value="Kotlin">Kotlin</option>
<option value="Objective-C">Objective-C</option>
<option value="Perl">Perl</option>
<option value="PHP">PHP</option>
<option value="PowerShell">PowerShell</option>
<option value="Python">Python</option>
<option value="Ruby">Ruby</option>
<option value="Rust">Rust</option>
<option value="Scala">Scala</option>
<option value="Shell">Shell</option>
<option value="Swift">Swift</option>
<option value="TypeScript">TypeScript</option>
</optgroup>
<optgroup label="Everything else">
<option value="1C Enterprise">1C Enterprise</option>
<option value="2-Dimensional Array">2-Dimensional Array</option>
<option value="4D">4D</option>
<option value="ABAP">ABAP</option>
<option value="ABAP CDS">ABAP CDS</option>
<option value="ABNF">ABNF</option>
<option value="ActionScript">ActionScript</option>
<option value="Ada">Ada</option>
<option value="Adblock Filter List">Adblock Filter List</option>
<option value="Adobe Font Metrics">Adobe Font Metrics</option>
<option value="Agda">Agda</option>
<option value="AGS Script">AGS Script</option>
<option value="AIDL">AIDL</option>
<option value="Aiken">Aiken</option>
</optgroup>
</select>
</dd>
</dl>
</fieldset>
<fieldset class="pb-3 mb-4 border-bottom color-border-muted min-width-0">
<h3>Repositories options</h3>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_stars">With this many stars</label></dt>
<dd><input id="search_stars" type="text" class="form-control js-advanced-search-prefix" placeholder="0..100, 200, >1000" data-search-prefix="stars:" data-search-type="Repositories"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_forks">With this many forks</label></dt>
<dd><input id="search_forks" type="text" class="form-control js-advanced-search-prefix" placeholder="50..100, 200, <5" data-search-prefix="forks:" data-search-type="Repositories"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_size">Of this size</label></dt>
<dd><input id="search_size" type="text" class="form-control js-advanced-search-prefix" placeholder="Repository size in KB" data-search-prefix="size:" data-search-type="Repositories"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_push">Pushed to</label></dt>
<dd><input id="search_push" type="text" class="form-control js-advanced-search-prefix" value="" placeholder="<YYYY-MM-DD" data-search-prefix="pushed:" data-search-type="Repositories"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_license">With this license</label></dt>
<dd>
<select id="search_license" class="form-select js-advanced-search-prefix" data-search-prefix="license:" data-search-type="Repositories">
<option value="">Any license</option>
<optgroup label="Licenses">
<option value="0bsd">BSD Zero Clause License</option>
<option value="afl-3.0">Academic Free License v3.0</option>
<option value="agpl-3.0">GNU Affero General Public License v3.0</option>
<option value="apache-2.0">Apache License 2.0</option>
<option value="artistic-2.0">Artistic License 2.0</option>
<option value="blueoak-1.0.0">Blue Oak Model License 1.0.0</option>
<option value="bsd-2-clause">BSD 2-Clause "Simplified" License</option>
<option value="bsd-2-clause-patent">BSD-2-Clause Plus Patent License</option>
<option value="bsd-3-clause">BSD 3-Clause "New" or "Revised" License</option>
<option value="bsd-3-clause-clear">BSD 3-Clause Clear License</option>
<option value="bsd-4-clause">BSD 4-Clause "Original" or "Old" License</option>
<option value="bsl-1.0">Boost Software License 1.0</option>
<option value="cc-by-4.0">Creative Commons Attribution 4.0 International</option>
<option value="cc-by-sa-4.0">Creative Commons Attribution Share Alike 4.0 International</option>
<option value="cc0-1.0">Creative Commons Zero v1.0 Universal</option>
<option value="cecill-2.1">CeCILL Free Software License Agreement v2.1</option>
<option value="cern-ohl-p-2.0">CERN Open Hardware Licence Version 2 - Permissive</option>
<option value="cern-ohl-s-2.0">CERN Open Hardware Licence Version 2 - Strongly Reciprocal</option>
<option value="cern-ohl-w-2.0">CERN Open Hardware Licence Version 2 - Weakly Reciprocal</option>
<option value="ecl-2.0">Educational Community License v2.0</option>
<option value="epl-1.0">Eclipse Public License 1.0</option>
<option value="epl-2.0">Eclipse Public License 2.0</option>
<option value="eupl-1.1">European Union Public License 1.1</option>
<option value="eupl-1.2">European Union Public License 1.2</option>
<option value="gfdl-1.3">GNU Free Documentation License v1.3</option>
<option value="gpl-2.0">GNU General Public License v2.0</option>
<option value="gpl-3.0">GNU General Public License v3.0</option>
<option value="isc">ISC License</option>
<option value="lgpl-2.1">GNU Lesser General Public License v2.1</option>
<option value="lgpl-3.0">GNU Lesser General Public License v3.0</option>
<option value="lppl-1.3c">LaTeX Project Public License v1.3c</option>
<option value="mit">MIT License</option>
<option value="mit-0">MIT No Attribution</option>
<option value="mpl-2.0">Mozilla Public License 2.0</option>
<option value="ms-pl">Microsoft Public License</option>
<option value="ms-rl">Microsoft Reciprocal License</option>
<option value="mulanpsl-2.0">Mulan Permissive Software License, Version 2</option>
<option value="ncsa">University of Illinois/NCSA Open Source License</option>
<option value="odbl-1.0">Open Data Commons Open Database License v1.0</option>
<option value="ofl-1.1">SIL Open Font License 1.1</option>
<option value="osl-3.0">Open Software License 3.0</option>
<option value="postgresql">PostgreSQL License</option>
<option value="unlicense">The Unlicense</option>
<option value="upl-1.0">Universal Permissive License v1.0</option>
<option value="vim">Vim License</option>
<option value="wtfpl">Do What The F*ck You Want To Public License</option>
<option value="zlib">zlib License</option>
</optgroup>
<optgroup label="License families">
<option value="cc">Creative Commons</option>
<option value="gpl">GNU General Public License</option>
<option value="lgpl">GNU Lesser General Public License</option>
</optgroup>
</select>
</dd>
</dl>
<label>
Return repositories <select class="form-select js-advanced-search-prefix" data-search-prefix="fork:" data-search-type="Repositories">
<option value="">not</option>
<option value="true">and</option>
<option value="only">only</option>
</select> including forks.
</label>
</fieldset>
<fieldset class="pb-3 mb-4 border-bottom color-border-muted min-width-0">
<h3>Code options</h3>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_extension">With this extension</label></dt>
<dd>
<input id="search_extension" type="text" class="form-control js-advanced-search-prefix" value="" placeholder="rb, py, jpg" data-search-type="Code" data-search-prefix="path:" data-glob-pattern="*.$0" data-regex-pattern="/.$0$/" data-use-or="true">
</dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_path">In this path</label></dt>
<dd><input id="search_path" type="text" class="form-control js-advanced-search-prefix" value="" placeholder="/foo/bar/baz/qux" data-search-prefix="path:" data-search-type="Code" data-use-or=""></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_filename">With this file name</label></dt>
<dd>
<input id="search_filename" type="text" class="form-control js-advanced-search-prefix" placeholder="app.rb, footer.erb" data-search-type="code:" data-search-prefix="path:" data-glob-pattern="**/$0" data-regex-pattern="/(^|/)$0$/" data-use-or="true">
</dd>
</dl>
<label>
Return code <select class="form-select js-advanced-search-prefix" data-search-prefix="fork:" data-search-type="Code">
<option value="">not</option>
<option value="true">and</option>
<option value="only">only</option>
</select> including forks.
</label>
</fieldset>
<fieldset class="pb-3 mb-4 border-bottom color-border-muted min-width-0">
<h3>Issues options</h3>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_state">In the state</label></dt>
<dd><select id="search_state" class="form-select js-advanced-search-prefix" data-search-prefix="state:" data-search-type="Issues">
<option value="">open/closed</option>
<option value="open">open</option>
<option value="closed">closed</option>
</select></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_state_reason">With the reason</label></dt>
<dd><select id="search_state_reason" class="form-select js-advanced-search-prefix" data-search-prefix="reason:" data-search-type="Issues">
<option value="">any reason</option>
<option value="completed">completed</option>
<option value="not planned">not planned</option>
<option value="reopened">reopened</option>
</select></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_comments">With this many comments</label></dt>
<dd><input id="search_comments" type="text" class="form-control js-advanced-search-prefix" value="" placeholder="0..100, >442" data-search-prefix="comments:" data-search-type="Issues"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_labels">With the labels</label></dt>
<dd><input id="search_labels" type="text" class="form-control js-advanced-search-prefix" value="" placeholder="bug, ie6" data-search-prefix="label:" data-search-type="Issues"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_author">Opened by the author</label></dt>
<dd><input id="search_author" type="text" class="form-control js-advanced-search-prefix" value="" placeholder="hubot, octocat" data-search-prefix="author:" data-search-type="Issues"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_mention">Mentioning the users</label></dt>
<dd><input id="search_mention" type="text" class="form-control js-advanced-search-prefix" value="" placeholder="tpope, mattt" data-search-prefix="mentions:" data-search-type="Issues"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_assignment">Assigned to the users</label></dt>
<dd><input id="search_assignment" type="text" class="form-control js-advanced-search-prefix" value="" placeholder="twp, jim" data-search-prefix="assignee:" data-search-type="Issues"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_updated_date">Updated before the date</label></dt>
<dd><input id="search_updated_date" type="text" class="form-control js-advanced-search-prefix" value="" placeholder="<YYYY-MM-DD" data-search-prefix="updated:" data-search-type="Issues"></dd>
</dl>
</fieldset>
<fieldset class="pb-3 mb-4 border-bottom color-border-muted min-width-0">
<h3>Users options</h3>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_full_name">With this full name</label></dt>
<dd><input id="search_full_name" type="text" class="form-control js-advanced-search-prefix" placeholder="Grace Hopper" data-search-prefix="fullname:" data-search-type="Users"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_location">From this location</label></dt>
<dd><input id="search_location" type="text" class="form-control js-advanced-search-prefix" placeholder="San Francisco, CA" data-search-prefix="location:" data-search-type="Users"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_followers">With this many followers</label></dt>
<dd><input id="search_followers" type="text" class="form-control js-advanced-search-prefix" placeholder="20..50, >200, <2" data-search-prefix="followers:" data-search-type="Users"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_public_repos">With this many public repositories</label></dt>
<dd><input id="search_public_repos" type="text" class="form-control js-advanced-search-prefix" placeholder="0, <42, >5" data-search-prefix="repos:" data-search-type="Users"></dd>
</dl>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_user_language">Working in this language</label></dt>
<dd>
<select id="search_user_language" name="l" class="form-select js-advanced-search-prefix" data-search-prefix="language:">
<option value="">Any language</option>
<optgroup label="Popular">
<option value="C">C</option>
<option value="C#">C#</option>
<option value="C++">C++</option>
<option value="CoffeeScript">CoffeeScript</option>
<option value="CSS">CSS</option>
<option value="Dart">Dart</option>
<option value="DM">DM</option>
<option value="Elixir">Elixir</option>
<option value="Go">Go</option>
<option value="Groovy">Groovy</option>
<option value="HTML">HTML</option>
<option value="Java">Java</option>
<option value="JavaScript">JavaScript</option>
<option value="Kotlin">Kotlin</option>
<option value="Objective-C">Objective-C</option>
<option value="Perl">Perl</option>
<option value="PHP">PHP</option>
<option value="PowerShell">PowerShell</option>
<option value="Python">Python</option>
<option value="Ruby">Ruby</option>
<option value="Rust">Rust</option>
<option value="Scala">Scala</option>
<option value="Shell">Shell</option>
<option value="Swift">Swift</option>
<option value="TypeScript">TypeScript</option>
</optgroup>
<optgroup label="Everything else">
<option value="1C Enterprise">1C Enterprise</option>
<option value="2-Dimensional Array">2-Dimensional Array</option>
<option value="4D">4D</option>
<option value="ABAP">ABAP</option>
<option value="ABAP CDS">ABAP CDS</option>
<option value="ABNF">ABNF</option>
<option value="ActionScript">ActionScript</option>
<option value="Ada">Ada</option>
<option value="Yul">Yul</option>
<option value="ZAP">ZAP</option>
<option value="Zeek">Zeek</option>
<option value="ZenScript">ZenScript</option>
<option value="Zephir">Zephir</option>
<option value="Zig">Zig</option>
<option value="ZIL">ZIL</option>
<option value="Zimpl">Zimpl</option>
<option value="Zmodel">Zmodel</option>
</optgroup>
</select>
</dd>
</dl>
</fieldset>
<fieldset class="pb-3 mb-4 border-bottom color-border-muted min-width-0">
<h3>Wiki options</h3>
<dl class="form-group flattened d-flex d-md-block flex-column">
<dt><label for="search_wiki_updated_date">Updated before the date</label></dt>
<dd><input id="search_wiki_updated_date" type="text" class="form-control js-advanced-search-prefix" placeholder="<YYYY-MM-DD" data-search-prefix="updated:" data-search-type="Wiki"></dd>
</dl>
</fieldset>
<div class="form-group flattened">
<div class="d-flex d-md-block"> <button type="submit" data-view-component="true" class="btn flex-auto"> Search
</button></div>
</div>
</div>
</form>

View File

@@ -0,0 +1,7 @@
GO https://store.example.com/product/laptop
WAIT `.product-details` 8
CLICK `button.add-to-cart`
WAIT `.cart-notification` 3
CLICK `.cart-icon`
WAIT `.checkout-btn` 5
CLICK `.checkout-btn`

View File

@@ -0,0 +1,43 @@
# Advanced control flow with IF, EXISTS, and REPEAT
# Define reusable procedures
PROC handle_cookie_banner
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-cookies`
IF (EXISTS `.privacy-notice`) THEN CLICK `.dismiss-privacy`
ENDPROC
PROC scroll_to_load
SCROLL DOWN 500
WAIT 0.5
ENDPROC
PROC try_login
CLICK `#email`
TYPE "user@example.com"
CLICK `#password`
TYPE "secure123"
CLICK `button[type="submit"]`
WAIT 2
ENDPROC
# Main script
GO https://example.com
WAIT 2
# Handle popups
handle_cookie_banner
# Conditional navigation based on login state
IF (EXISTS `.user-menu`) THEN CLICK `.dashboard-link` ELSE try_login
# Repeat scrolling based on content count
REPEAT (scroll_to_load, 5)
# Load more content while button exists
REPEAT (CLICK `.load-more`, `document.querySelector('.load-more') && !document.querySelector('.no-more-content')`)
# Process items conditionally
IF (`document.querySelectorAll('.item').length > 10`) THEN EVAL `console.log('Found ' + document.querySelectorAll('.item').length + ' items')`
# Complex condition with viewport check
IF (`window.innerWidth < 768 && document.querySelector('.mobile-menu')`) THEN CLICK `.mobile-menu-toggle`

View File

@@ -0,0 +1,8 @@
GO https://myapp.com
WAIT 2
IF (EXISTS `.user-avatar`) THEN CLICK `.logout` ELSE CLICK `.login`
WAIT `#auth-form` 5
IF (EXISTS `#auth-form`) THEN TYPE "user@example.com"
IF (EXISTS `#auth-form`) THEN PRESS Tab
IF (EXISTS `#auth-form`) THEN TYPE "password123"
IF (EXISTS `#auth-form`) THEN CLICK `button[type="submit"]`

View File

@@ -0,0 +1,56 @@
# Data extraction example
# Scrapes product information from an e-commerce site
# Navigate to products page
GO https://shop.example.com/products
WAIT `.product-list` 10
# Scroll to load lazy-loaded content
SCROLL DOWN 500
WAIT 1
SCROLL DOWN 500
WAIT 1
SCROLL DOWN 500
WAIT 2
# Extract product data
EVAL `
// Extract all product information
const products = Array.from(document.querySelectorAll('.product-card')).map((card, index) => {
return {
id: index + 1,
name: card.querySelector('.product-title')?.textContent?.trim() || 'N/A',
price: card.querySelector('.price')?.textContent?.trim() || 'N/A',
rating: card.querySelector('.rating')?.textContent?.trim() || 'N/A',
availability: card.querySelector('.in-stock') ? 'In Stock' : 'Out of Stock',
image: card.querySelector('img')?.src || 'N/A'
};
});
// Log results
console.log('=== Product Extraction Results ===');
console.log('Total products found:', products.length);
console.log(JSON.stringify(products, null, 2));
// Save to localStorage for retrieval
localStorage.setItem('scraped_products', JSON.stringify(products));
`
# Optional: Click on first product for details
CLICK `.product-card:first-child`
WAIT `.product-details` 5
# Extract detailed information
EVAL `
const details = {
description: document.querySelector('.product-description')?.textContent?.trim(),
specifications: Array.from(document.querySelectorAll('.spec-item')).map(spec => ({
label: spec.querySelector('.spec-label')?.textContent,
value: spec.querySelector('.spec-value')?.textContent
})),
reviews: document.querySelector('.review-count')?.textContent
};
console.log('=== Product Details ===');
console.log(JSON.stringify(details, null, 2));
`

View File

@@ -0,0 +1,8 @@
GO https://company.com/contact
WAIT `form#contact` 10
TYPE "John Smith"
PRESS Tab
TYPE "john@email.com"
PRESS Tab
TYPE "Need help with my order"
CLICK `button[type="submit"]`

View File

@@ -0,0 +1,7 @@
GO https://news.example.com
WAIT `.article-list` 5
REPEAT (SCROLL DOWN 500, 3)
WAIT 1
REPEAT (CLICK `.load-more`, `document.querySelector('.load-more') !== null`)
WAIT 2
IF (`document.querySelectorAll('.article').length > 20`) THEN EVAL `console.log('Loaded enough articles')`

View File

@@ -0,0 +1,36 @@
# Login flow with error handling
# Demonstrates procedures, variables, and conditional checks
# Define login procedure
PROC perform_login
CLICK `input#email`
TYPE $email
CLICK `input#password`
TYPE $password
CLICK `button.login-submit`
ENDPROC
# Set credentials
SET email = "user@example.com"
SET password = "securePassword123"
# Navigate to login page
GO https://app.example.com/login
WAIT `.login-container` 15
# Attempt login
perform_login
# Wait for page to load
WAIT 3
# Check if login was successful
EVAL `
if (document.querySelector('.dashboard')) {
console.log('Login successful - on dashboard');
} else if (document.querySelector('.error-message')) {
console.log('Login failed:', document.querySelector('.error-message').textContent);
} else {
console.log('Unknown state after login');
}
`

View File

@@ -0,0 +1,106 @@
# Multi-step e-commerce workflow
# Complete purchase flow with procedures and error handling
# Reusable procedures
PROC search_product
CLICK `input.search-bar`
TYPE $search_term
PRESS Enter
WAIT `.search-results` 10
ENDPROC
PROC add_first_item_to_cart
CLICK `.product-item:first-child .add-to-cart`
WAIT ".added-to-cart-notification" 3
ENDPROC
PROC go_to_checkout
CLICK `.cart-icon`
WAIT `.cart-drawer` 5
CLICK `button.proceed-to-checkout`
WAIT `.checkout-page` 10
ENDPROC
PROC fill_customer_info
# Billing information
CLICK `#billing-firstname`
TYPE $first_name
CLICK `#billing-lastname`
TYPE $last_name
CLICK `#billing-email`
TYPE $email
CLICK `#billing-phone`
TYPE $phone
# Address
CLICK `#billing-address`
TYPE $address
CLICK `#billing-city`
TYPE $city
CLICK `#billing-state`
TYPE $state
CLICK `#billing-zip`
TYPE $zip
ENDPROC
PROC select_shipping
CLICK `input[value="standard"]`
WAIT 1
ENDPROC
# Set all required variables
SET search_term = "wireless headphones"
SET first_name = "John"
SET last_name = "Doe"
SET email = "john.doe@example.com"
SET phone = "555-0123"
SET address = "123 Main Street"
SET city = "San Francisco"
SET state = "CA"
SET zip = "94105"
# Main workflow starts here
GO https://shop.example.com
WAIT `.homepage-loaded` 10
# Step 1: Search and add to cart
search_product
EVAL `console.log('Found', document.querySelectorAll('.product-item').length, 'products')`
add_first_item_to_cart
# Add a second item
CLICK `.product-item:nth-child(2) .add-to-cart`
WAIT 2
# Step 2: Go to checkout
go_to_checkout
# Step 3: Fill customer information
fill_customer_info
# Step 4: Select shipping method
select_shipping
# Step 5: Continue to payment
CLICK `button.continue-to-payment`
WAIT `.payment-section` 10
# Log order summary
EVAL `
const orderTotal = document.querySelector('.order-total')?.textContent;
const itemCount = document.querySelectorAll('.order-item').length;
console.log('=== Order Summary ===');
console.log('Items:', itemCount);
console.log('Total:', orderTotal);
// Get all items
const items = Array.from(document.querySelectorAll('.order-item')).map(item => ({
name: item.querySelector('.item-name')?.textContent,
quantity: item.querySelector('.item-quantity')?.textContent,
price: item.querySelector('.item-price')?.textContent
}));
console.log('Items:', JSON.stringify(items, null, 2));
`
# Note: Stopping here before actual payment submission
EVAL `console.log('Workflow completed - stopped before payment submission')`

View File

@@ -0,0 +1,8 @@
GO https://app.example.com
WAIT `.nav-menu` 8
CLICK `a[href="/products"]`
WAIT 2
CLICK `a[href="/about"]`
WAIT 2
BACK
WAIT 1

View File

@@ -0,0 +1,8 @@
GO https://myapp.com/login
WAIT `input#email` 5
CLICK `input#email`
TYPE "user@example.com"
PRESS Tab
TYPE "password123"
CLICK `button.login-btn`
WAIT `.dashboard` 10

View File

@@ -0,0 +1,7 @@
GO https://responsive.site.com
WAIT 2
IF (`window.innerWidth < 768`) THEN CLICK `.mobile-menu`
IF (`window.innerWidth < 768`) THEN WAIT `.mobile-nav` 3
IF (`window.innerWidth >= 768`) THEN CLICK `.desktop-menu li:nth-child(2)`
REPEAT (CLICK `.next-slide`, 5)
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-cookies`

View File

@@ -0,0 +1,8 @@
GO https://news.site.com
WAIT `.article-list` 10
SCROLL DOWN 500
WAIT 1
SCROLL DOWN 500
WAIT 1
CLICK `.article:nth-child(5)`
WAIT `.article-content` 5

View File

@@ -0,0 +1,7 @@
GO https://shop.example.com
WAIT `.search-bar` 10
CLICK `.search-bar`
TYPE "wireless headphones"
PRESS Enter
WAIT `.results` 5
CLICK `.product-card:first-child`

View File

@@ -0,0 +1,19 @@
# Simple form submission example
# This script fills out a contact form and submits it
GO https://example.com/contact
WAIT `form#contact-form` 10
# Fill out the form fields
CLICK `input[name="name"]`
TYPE "Alice Smith"
PRESS Tab
TYPE "alice@example.com"
PRESS Tab
TYPE "I'd like to learn more about your services"
# Submit the form
CLICK `button[type="submit"]`
# Wait for success message
WAIT "Thank you for your message" 5

View File

@@ -0,0 +1,11 @@
PROC fill_field
TYPE "test@example.com"
PRESS Tab
ENDPROC
GO https://forms.example.com
WAIT `form` 5
IF (EXISTS `input[type="email"]`) THEN CLICK `input[type="email"]`
IF (EXISTS `input[type="email"]`) THEN fill_field
REPEAT (PRESS Tab, `document.activeElement.type !== 'submit'`)
CLICK `button[type="submit"]`

View File

@@ -0,0 +1,396 @@
# C4A-Script Interactive Tutorial
A comprehensive web-based tutorial for learning and experimenting with C4A-Script - Crawl4AI's visual web automation language.
## 🚀 Quick Start
### Prerequisites
- Python 3.7+
- Modern web browser (Chrome, Firefox, Safari, Edge)
### Running the Tutorial
1. **Clone and Navigate**
```bash
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai/docs/examples/c4a_script/tutorial/
```
2. **Install Dependencies**
```bash
pip install flask
```
3. **Launch the Server**
```bash
python server.py
```
4. **Open in Browser**
```
http://localhost:8080
```
**🌐 Try Online**: [Live Demo](https://docs.crawl4ai.com/c4a-script/demo)
### 2. Try Your First Script
```c4a
# Basic interaction
GO playground/
WAIT `body` 2
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
CLICK `#start-tutorial`
```
## 🎯 What You'll Learn
### Core Features
- **📝 Text Editor**: Write C4A-Script with syntax highlighting
- **🧩 Visual Editor**: Build scripts using drag-and-drop Blockly interface
- **🎬 Recording Mode**: Capture browser actions and auto-generate scripts
- **⚡ Live Execution**: Run scripts in real-time with instant feedback
- **📊 Timeline View**: Visualize and edit automation steps
## 📚 Tutorial Content
### Basic Commands
- **Navigation**: `GO url`
- **Waiting**: `WAIT selector timeout` or `WAIT seconds`
- **Clicking**: `CLICK selector`
- **Typing**: `TYPE "text"`
- **Scrolling**: `SCROLL DOWN/UP amount`
### Control Flow
- **Conditionals**: `IF (condition) THEN action`
- **Loops**: `REPEAT (action, condition)`
- **Procedures**: Define reusable command sequences
### Advanced Features
- **JavaScript evaluation**: `EVAL code`
- **Variables**: `SET name = "value"`
- **Complex selectors**: CSS selectors in backticks
## 🎮 Interactive Playground Features
The tutorial includes a fully interactive web app with:
### 1. **Authentication System**
- Login form with validation
- Session management
- Protected content
### 2. **Dynamic Content**
- Infinite scroll products
- Pagination controls
- Load more buttons
### 3. **Complex Forms**
- Multi-step wizards
- Dynamic field visibility
- Form validation
### 4. **Interactive Elements**
- Tabs and accordions
- Modals and popups
- Expandable content
### 5. **Data Tables**
- Sortable columns
- Search functionality
- Export options
## 🛠️ Tutorial Features
### Live Code Editor
- Syntax highlighting
- Real-time compilation
- Error messages with suggestions
### JavaScript Output Viewer
- See generated JavaScript code
- Edit and test JS directly
- Understand the compilation
### Visual Execution
- Step-by-step progress
- Element highlighting
- Console output
### Example Scripts
Load pre-written examples demonstrating:
- Cookie banner handling
- Login workflows
- Infinite scroll automation
- Multi-step form completion
- Complex interaction sequences
## 📖 Tutorial Sections
### 1. Getting Started
Learn basic commands and syntax:
```c4a
GO https://example.com
WAIT `.content` 5
CLICK `.button`
```
### 2. Handling Dynamic Content
Master waiting strategies and conditionals:
```c4a
IF (EXISTS `.popup`) THEN CLICK `.close`
WAIT `.results` 10
```
### 3. Form Automation
Fill and submit forms:
```c4a
CLICK `#email`
TYPE "user@example.com"
CLICK `button[type="submit"]`
```
### 4. Advanced Workflows
Build complex automation flows:
```c4a
PROC login
CLICK `#username`
TYPE $username
CLICK `#password`
TYPE $password
CLICK `#login-btn`
ENDPROC
SET username = "demo"
SET password = "pass123"
login
```
## 🎯 Practice Challenges
### Challenge 1: Cookie & Popups
Handle the cookie banner and newsletter popup that appear on page load.
### Challenge 2: Complete Login
Successfully log into the application using the demo credentials.
### Challenge 3: Load All Products
Use infinite scroll to load all 100 products in the catalog.
### Challenge 4: Multi-step Survey
Complete the entire multi-step survey form.
### Challenge 5: Full Workflow
Create a script that logs in, browses products, and exports data.
## 💡 Tips & Tricks
### 1. Use Specific Selectors
```c4a
# Good - specific
CLICK `button.submit-order`
# Bad - too generic
CLICK `button`
```
### 2. Always Handle Popups
```c4a
# Check for common popups
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
IF (EXISTS `.newsletter-modal`) THEN CLICK `.close`
```
### 3. Add Appropriate Waits
```c4a
# Wait for elements before interacting
WAIT `.form` 5
CLICK `#submit`
```
### 4. Use Procedures for Reusability
```c4a
PROC handle_popups
IF (EXISTS `.popup`) THEN CLICK `.close`
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
ENDPROC
# Use anywhere
handle_popups
```
## 🔧 Troubleshooting
### Common Issues
1. **"Element not found"**
- Add a WAIT before clicking
- Check selector specificity
- Verify element exists with IF
2. **"Timeout waiting for selector"**
- Increase timeout value
- Check if element is dynamically loaded
- Verify selector is correct
3. **"Missing THEN keyword"**
- All IF statements need THEN
- Format: `IF (condition) THEN action`
## 🚀 Using with Crawl4AI
Once you've mastered C4A-Script in the tutorial, use it with Crawl4AI:
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
config = CrawlerRunConfig(
url="https://example.com",
c4a_script="""
WAIT `.content` 5
IF (EXISTS `.load-more`) THEN CLICK `.load-more`
WAIT `.new-content` 3
"""
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(config=config)
```
## 📝 Example Scripts
Check the `scripts/` folder for complete examples:
- `01-basic-interaction.c4a` - Getting started
- `02-login-flow.c4a` - Authentication
- `03-infinite-scroll.c4a` - Dynamic content
- `04-multi-step-form.c4a` - Complex forms
- `05-complex-workflow.c4a` - Full automation
## 🏗️ Developer Guide
### Project Architecture
```
tutorial/
├── server.py # Flask application server
├── assets/ # Tutorial-specific assets
│ ├── app.js # Main application logic
│ ├── c4a-blocks.js # Custom Blockly blocks
│ ├── c4a-generator.js # Code generation
│ ├── blockly-manager.js # Blockly integration
│ └── styles.css # Main styling
├── playground/ # Interactive demo environment
│ ├── index.html # Demo web application
│ ├── app.js # Demo app logic
│ └── styles.css # Demo styling
├── scripts/ # Example C4A scripts
└── index.html # Main tutorial interface
```
### Key Components
#### 1. TutorialApp (`assets/app.js`)
Main application controller managing:
- Code editor integration (CodeMirror)
- Script execution and browser preview
- Tutorial navigation and lessons
- State management and persistence
#### 2. BlocklyManager (`assets/blockly-manager.js`)
Visual programming interface:
- Custom C4A-Script block definitions
- Bidirectional sync between visual blocks and text
- Real-time code generation
- Dark theme integration
#### 3. Recording System
Powers the recording functionality:
- Browser event capture
- Smart event grouping and filtering
- Automatic C4A-Script generation
- Timeline visualization
### Customization
#### Adding New Commands
1. **Define Block** (`assets/c4a-blocks.js`)
2. **Add Generator** (`assets/c4a-generator.js`)
3. **Update Parser** (`assets/blockly-manager.js`)
#### Themes and Styling
- Main styles: `assets/styles.css`
- Theme variables: CSS custom properties
- Dark mode: Auto-applied based on system preference
### Configuration
```python
# server.py configuration
PORT = 8080
DEBUG = True
THREADED = True
```
### API Endpoints
- `GET /` - Main tutorial interface
- `GET /playground/` - Interactive demo environment
- `POST /execute` - Script execution endpoint
- `GET /examples/<script>` - Load example scripts
## 🔧 Troubleshooting
### Common Issues
**Port Already in Use**
```bash
# Kill existing process
lsof -ti:8080 | xargs kill -9
# Or use different port
python server.py --port 8081
```
**Blockly Not Loading**
- Check browser console for JavaScript errors
- Verify all static files are served correctly
- Ensure proper script loading order
**Recording Issues**
- Verify iframe permissions
- Check cross-origin communication
- Ensure event listeners are attached
### Debug Mode
Enable detailed logging by setting `DEBUG = True` in `assets/app.js`
## 📚 Additional Resources
- **[C4A-Script Documentation](../../md_v2/core/c4a-script.md)** - Complete language guide
- **[API Reference](../../md_v2/api/c4a-script-reference.md)** - Detailed command documentation
- **[Live Demo](https://docs.crawl4ai.com/c4a-script/demo)** - Try without installation
- **[Example Scripts](../)** - More automation examples
## 🤝 Contributing
### Bug Reports
1. Check existing issues on GitHub
2. Provide minimal reproduction steps
3. Include browser and system information
4. Add relevant console logs
### Feature Requests
1. Fork the repository
2. Create feature branch: `git checkout -b feature/my-feature`
3. Test thoroughly with different browsers
4. Update documentation
5. Submit pull request
### Code Style
- Use consistent indentation (2 spaces for JS, 4 for Python)
- Add comments for complex logic
- Follow existing naming conventions
- Test with multiple browsers
---
**Happy Automating!** 🎉
Need help? Check our [documentation](https://docs.crawl4ai.com) or open an issue on [GitHub](https://github.com/unclecode/crawl4ai).

View File

@@ -0,0 +1,906 @@
/* ================================================================
C4A-Script Tutorial - App Layout CSS
Terminal theme with Dank Mono font
================================================================ */
/* CSS Variables */
:root {
--bg-primary: #070708;
--bg-secondary: #0e0e10;
--bg-tertiary: #1a1a1b;
--border-color: #2a2a2c;
--border-hover: #3a3a3c;
--text-primary: #e0e0e0;
--text-secondary: #8b8b8d;
--text-muted: #606065;
--primary-color: #0fbbaa;
--primary-hover: #0da89a;
--primary-dim: #0a8577;
--error-color: #ff5555;
--warning-color: #ffb86c;
--success-color: #50fa7b;
--info-color: #8be9fd;
--code-bg: #1e1e20;
--modal-overlay: rgba(0, 0, 0, 0.8);
}
/* Base Reset */
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
/* Fonts */
@font-face {
font-family: 'Dank Mono';
src: url('DankMono-Regular.woff2') format('woff2');
font-weight: 400;
font-style: normal;
}
@font-face {
font-family: 'Dank Mono';
src: url('DankMono-Bold.woff2') format('woff2');
font-weight: 700;
font-style: normal;
}
@font-face {
font-family: 'Dank Mono';
src: url('DankMono-Italic.woff2') format('woff2');
font-weight: 400;
font-style: italic;
}
/* Body & App Container */
body {
font-family: 'Dank Mono', 'Monaco', 'Consolas', monospace;
background: var(--bg-primary);
color: var(--text-primary);
font-size: 14px;
line-height: 1.6;
overflow: hidden;
}
.app-container {
display: flex;
height: 100vh;
width: 100vw;
overflow: hidden;
}
/* Panels */
.editor-panel,
.playground-panel {
display: flex;
flex-direction: column;
height: 100%;
overflow: hidden;
}
.editor-panel {
flex: 1;
background: var(--bg-secondary);
border-right: 1px solid var(--border-color);
min-width: 400px;
}
.playground-panel {
flex: 1;
background: var(--bg-primary);
min-width: 400px;
}
/* Panel Headers */
.panel-header {
display: flex;
justify-content: space-between;
align-items: center;
padding: 12px 16px;
background: var(--bg-tertiary);
border-bottom: 1px solid var(--border-color);
flex-shrink: 0;
}
.panel-header h2 {
font-size: 16px;
font-weight: 600;
color: var(--primary-color);
margin: 0;
}
.header-actions {
display: flex;
gap: 8px;
}
/* Action Buttons */
.action-btn {
display: flex;
align-items: center;
gap: 6px;
padding: 6px 12px;
background: var(--bg-secondary);
color: var(--text-secondary);
border: 1px solid var(--border-color);
border-radius: 4px;
font-family: inherit;
font-size: 13px;
cursor: pointer;
transition: all 0.2s;
}
.action-btn:hover {
background: var(--bg-tertiary);
color: var(--text-primary);
border-color: var(--border-hover);
}
.action-btn.primary {
background: var(--primary-color);
color: var(--bg-primary);
border-color: var(--primary-color);
}
.action-btn.primary:hover {
background: var(--primary-hover);
border-color: var(--primary-hover);
}
.action-btn .icon {
font-size: 16px;
}
/* Editor Wrapper */
.editor-wrapper {
flex: 1;
display: flex;
overflow: hidden;
position: relative;
z-index: 1; /* Ensure it's above any potential overlays */
}
.editor-wrapper .CodeMirror {
flex: 1;
height: 100%;
font-family: 'Dank Mono', monospace;
font-size: 14px;
line-height: 1.5;
}
/* Ensure CodeMirror is interactive */
.CodeMirror {
background: var(--bg-primary) !important;
}
.CodeMirror-scroll {
overflow: auto !important;
}
/* Make cursor more visible */
.CodeMirror-cursor {
border-left: 2px solid var(--primary-color) !important;
border-left-width: 2px !important;
opacity: 1 !important;
visibility: visible !important;
}
/* Ensure cursor is visible when focused */
.CodeMirror-focused .CodeMirror-cursor {
visibility: visible !important;
}
/* Fix for CodeMirror in flex container */
.CodeMirror-sizer {
min-height: auto !important;
}
/* Remove aggressive pointer-events override */
.CodeMirror-code {
cursor: text;
}
.editor-wrapper textarea {
display: none;
}
/* Output Section (Bottom of Editor) */
.output-section {
height: 250px;
border-top: 1px solid var(--border-color);
display: flex;
flex-direction: column;
flex-shrink: 0;
}
/* Tabs */
.tabs {
display: flex;
background: var(--bg-tertiary);
border-bottom: 1px solid var(--border-color);
flex-shrink: 0;
}
.tab {
padding: 8px 20px;
background: transparent;
color: var(--text-secondary);
border: none;
border-bottom: 2px solid transparent;
font-family: inherit;
font-size: 13px;
cursor: pointer;
transition: all 0.2s;
}
.tab:hover {
color: var(--text-primary);
background: var(--bg-secondary);
}
.tab.active {
color: var(--primary-color);
border-bottom-color: var(--primary-color);
}
/* Tab Content */
.tab-content {
flex: 1;
overflow: hidden;
}
.tab-pane {
display: none;
height: 100%;
overflow-y: auto;
}
.tab-pane.active {
display: block;
}
/* Console */
.console {
padding: 12px;
background: var(--bg-primary);
font-size: 13px;
min-height: 100%;
}
.console-line {
margin-bottom: 8px;
display: flex;
align-items: flex-start;
gap: 8px;
}
.console-prompt {
color: var(--primary-color);
flex-shrink: 0;
}
.console-text {
color: var(--text-primary);
}
.console-error {
color: var(--error-color);
}
.console-warning {
color: var(--warning-color);
}
.console-success {
color: var(--success-color);
}
/* JavaScript Output */
.js-output-header {
display: flex;
justify-content: flex-end;
padding: 8px 12px;
background: var(--bg-tertiary);
border-bottom: 1px solid var(--border-color);
}
.js-actions {
display: flex;
gap: 8px;
}
.mini-btn {
padding: 4px 8px;
background: var(--bg-secondary);
color: var(--text-secondary);
border: 1px solid var(--border-color);
border-radius: 3px;
font-size: 12px;
cursor: pointer;
transition: all 0.2s;
}
.mini-btn:hover {
background: var(--bg-primary);
color: var(--text-primary);
}
.js-output {
padding: 12px;
background: var(--code-bg);
color: var(--text-primary);
font-family: 'Dank Mono', monospace;
font-size: 13px;
line-height: 1.5;
white-space: pre-wrap;
margin: 0;
min-height: calc(100% - 44px);
}
/* Execution Progress */
.execution-progress {
padding: 12px;
background: var(--bg-primary);
}
.progress-item {
display: flex;
align-items: center;
gap: 8px;
margin-bottom: 8px;
font-size: 13px;
}
.progress-icon {
color: var(--text-muted);
}
.progress-item.active .progress-icon {
color: var(--info-color);
animation: pulse 1s infinite;
}
.progress-item.completed .progress-icon {
color: var(--success-color);
}
.progress-item.error .progress-icon {
color: var(--error-color);
}
/* Playground */
.playground-wrapper {
flex: 1;
overflow: hidden;
}
#playground-frame {
width: 100%;
height: 100%;
border: none;
background: var(--bg-secondary);
}
/* Tutorial Intro Modal */
.tutorial-intro-modal {
position: fixed;
top: 0;
left: 0;
right: 0;
bottom: 0;
background: var(--modal-overlay);
display: flex;
align-items: center;
justify-content: center;
z-index: 2000;
transition: opacity 0.3s;
}
.tutorial-intro-modal.hidden {
display: none;
}
.intro-content {
background: var(--bg-tertiary);
border: 1px solid var(--border-color);
border-radius: 8px;
padding: 32px;
max-width: 500px;
box-shadow: 0 16px 48px rgba(0, 0, 0, 0.6);
}
.intro-content h2 {
color: var(--primary-color);
margin-bottom: 16px;
font-size: 24px;
}
.intro-content p {
color: var(--text-primary);
margin-bottom: 16px;
line-height: 1.6;
}
.intro-content ul {
list-style: none;
margin-bottom: 24px;
}
.intro-content li {
color: var(--text-secondary);
margin-bottom: 8px;
padding-left: 20px;
position: relative;
}
.intro-content li:before {
content: "▸";
position: absolute;
left: 0;
color: var(--primary-color);
}
.intro-actions {
display: flex;
gap: 12px;
justify-content: flex-end;
}
.intro-btn {
padding: 10px 24px;
background: var(--bg-secondary);
color: var(--text-primary);
border: 1px solid var(--border-color);
border-radius: 4px;
font-family: inherit;
font-size: 14px;
cursor: pointer;
transition: all 0.2s;
}
.intro-btn:hover {
background: var(--bg-primary);
border-color: var(--border-hover);
}
.intro-btn.primary {
background: var(--primary-color);
color: var(--bg-primary);
border-color: var(--primary-color);
}
.intro-btn.primary:hover {
background: var(--primary-hover);
border-color: var(--primary-hover);
}
/* Tutorial Navigation Bar */
.tutorial-nav {
position: fixed;
top: 0;
left: 0;
right: 0;
background: var(--bg-tertiary);
border-bottom: 1px solid var(--primary-color);
z-index: 1000;
transition: transform 0.3s;
}
.tutorial-nav.hidden {
transform: translateY(-100%);
}
.tutorial-nav-content {
display: flex;
align-items: center;
justify-content: space-between;
padding: 16px 24px;
}
.tutorial-left {
flex: 1;
}
.tutorial-step-title {
display: flex;
align-items: center;
gap: 16px;
margin-bottom: 8px;
}
.tutorial-step-title span:first-child {
color: var(--text-secondary);
font-size: 12px;
text-transform: uppercase;
}
.tutorial-step-title span:last-child {
color: var(--primary-color);
font-weight: 600;
font-size: 16px;
}
.tutorial-description {
color: var(--text-primary);
margin: 0;
font-size: 14px;
max-width: 600px;
}
.tutorial-right {
display: flex;
align-items: center;
}
.tutorial-progress-bar {
height: 3px;
background: var(--bg-secondary);
position: absolute;
bottom: 0;
left: 0;
right: 0;
}
.tutorial-progress-bar .progress-fill {
height: 100%;
background: var(--primary-color);
transition: width 0.3s;
}
/* Adjust app container when tutorial is active */
.app-container.tutorial-active {
padding-top: 80px;
}
.tutorial-controls {
display: flex;
gap: 12px;
}
.nav-btn {
padding: 8px 16px;
background: var(--bg-secondary);
color: var(--text-primary);
border: 1px solid var(--border-color);
border-radius: 4px;
font-family: inherit;
font-size: 13px;
cursor: pointer;
transition: all 0.2s;
}
.nav-btn:hover:not(:disabled) {
background: var(--bg-primary);
border-color: var(--border-hover);
}
.nav-btn:disabled {
opacity: 0.5;
cursor: not-allowed;
}
.nav-btn.primary {
background: var(--primary-color);
color: var(--bg-primary);
border-color: var(--primary-color);
}
.nav-btn.primary:hover {
background: var(--primary-hover);
border-color: var(--primary-hover);
}
.exit-btn {
width: 32px;
height: 32px;
background: transparent;
color: var(--text-secondary);
border: none;
font-size: 20px;
cursor: pointer;
border-radius: 4px;
transition: all 0.2s;
margin-left: 16px;
}
.exit-btn:hover {
background: var(--bg-secondary);
color: var(--text-primary);
}
/* Fullscreen Mode */
.playground-panel.fullscreen {
position: fixed;
top: 0;
left: 0;
right: 0;
bottom: 0;
z-index: 1500;
}
/* Animations */
@keyframes pulse {
0%, 100% { opacity: 1; }
50% { opacity: 0.5; }
}
/* Scrollbar Styling */
::-webkit-scrollbar {
width: 10px;
height: 10px;
}
::-webkit-scrollbar-track {
background: var(--bg-secondary);
}
::-webkit-scrollbar-thumb {
background: var(--border-color);
border-radius: 5px;
}
::-webkit-scrollbar-thumb:hover {
background: var(--border-hover);
}
/* Responsive */
@media (max-width: 768px) {
.app-container {
flex-direction: column;
}
.editor-panel,
.playground-panel {
min-width: auto;
width: 100%;
}
.editor-panel {
border-right: none;
border-bottom: 1px solid var(--border-color);
}
.output-section {
height: 200px;
}
}
/* ================================================================
Recording Timeline Styles
================================================================ */
.action-btn.record {
background: var(--bg-tertiary);
border-color: var(--error-color);
}
.action-btn.record:hover {
background: var(--error-color);
border-color: var(--error-color);
}
.action-btn.record.recording {
background: var(--error-color);
animation: pulse 1.5s infinite;
}
.action-btn.record.recording .icon {
animation: blink 1s infinite;
}
@keyframes pulse {
0%, 100% { opacity: 1; }
50% { opacity: 0.8; }
}
@keyframes blink {
0%, 100% { opacity: 1; }
50% { opacity: 0.3; }
}
.editor-container {
flex: 1;
display: flex;
flex-direction: column;
overflow: hidden;
}
#editor-view,
#timeline-view {
flex: 1;
display: flex;
flex-direction: column;
overflow: hidden;
}
.recording-timeline {
background: var(--bg-secondary);
display: flex;
flex-direction: column;
height: 100%;
}
.timeline-header {
display: flex;
justify-content: space-between;
align-items: center;
padding: 10px 15px;
border-bottom: 1px solid var(--border-color);
background: var(--bg-tertiary);
}
.timeline-header h3 {
font-size: 14px;
font-weight: 600;
color: var(--text-primary);
margin: 0;
}
.timeline-actions {
display: flex;
gap: 8px;
}
.timeline-events {
flex: 1;
overflow-y: auto;
padding: 10px;
}
.timeline-event {
display: flex;
align-items: center;
padding: 8px 10px;
margin-bottom: 6px;
background: var(--bg-tertiary);
border: 1px solid var(--border-color);
border-radius: 4px;
transition: all 0.2s;
cursor: pointer;
}
.timeline-event:hover {
border-color: var(--border-hover);
background: var(--code-bg);
}
.timeline-event.selected {
border-color: var(--primary-color);
background: rgba(15, 187, 170, 0.1);
}
.event-checkbox {
margin-right: 10px;
width: 16px;
height: 16px;
cursor: pointer;
}
.event-time {
font-size: 11px;
color: var(--text-muted);
margin-right: 10px;
font-family: 'Dank Mono', monospace;
min-width: 45px;
}
.event-command {
flex: 1;
font-family: 'Dank Mono', monospace;
font-size: 13px;
color: var(--text-primary);
}
.event-command .cmd-name {
color: var(--primary-color);
font-weight: 600;
}
.event-command .cmd-selector {
color: var(--info-color);
}
.event-command .cmd-value {
color: var(--warning-color);
}
.event-command .cmd-detail {
color: var(--text-secondary);
font-size: 11px;
margin-left: 5px;
}
.event-edit {
margin-left: 10px;
padding: 2px 8px;
font-size: 11px;
background: var(--bg-secondary);
border: 1px solid var(--border-color);
color: var(--text-secondary);
cursor: pointer;
border-radius: 3px;
transition: all 0.2s;
}
.event-edit:hover {
border-color: var(--primary-color);
color: var(--primary-color);
}
/* Event Editor Modal */
.modal-overlay {
position: fixed;
top: 0;
left: 0;
right: 0;
bottom: 0;
background: var(--modal-overlay);
z-index: 999;
}
.event-editor-modal {
position: fixed;
top: 50%;
left: 50%;
transform: translate(-50%, -50%);
background: var(--bg-secondary);
border: 1px solid var(--border-color);
border-radius: 8px;
padding: 20px;
z-index: 1000;
min-width: 400px;
}
.event-editor-modal h4 {
margin: 0 0 15px 0;
color: var(--text-primary);
font-family: 'Dank Mono', monospace;
}
.editor-field {
margin-bottom: 15px;
}
.editor-field label {
display: block;
margin-bottom: 5px;
font-size: 12px;
color: var(--text-secondary);
font-family: 'Dank Mono', monospace;
}
.editor-field input,
.editor-field select {
width: 100%;
padding: 8px;
background: var(--bg-tertiary);
border: 1px solid var(--border-color);
color: var(--text-primary);
border-radius: 4px;
font-family: 'Dank Mono', monospace;
font-size: 13px;
}
.editor-field input:focus,
.editor-field select:focus {
outline: none;
border-color: var(--primary-color);
}
.editor-actions {
display: flex;
justify-content: flex-end;
gap: 10px;
margin-top: 20px;
}
/* Blockly Button */
#blockly-btn .icon {
font-size: 16px;
}
/* Hidden State */
.hidden {
display: none !important;
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,591 @@
// Blockly Manager for C4A-Script
// Handles Blockly workspace, code generation, and synchronization with text editor
class BlocklyManager {
constructor(tutorialApp) {
this.app = tutorialApp;
this.workspace = null;
this.isUpdating = false; // Prevent circular updates
this.blocklyVisible = false;
this.toolboxXml = this.generateToolbox();
this.init();
}
init() {
this.setupBlocklyContainer();
this.initializeWorkspace();
this.setupEventHandlers();
this.setupSynchronization();
}
setupBlocklyContainer() {
// Create blockly container div
const editorContainer = document.querySelector('.editor-container');
const blocklyDiv = document.createElement('div');
blocklyDiv.id = 'blockly-view';
blocklyDiv.className = 'blockly-workspace hidden';
blocklyDiv.style.height = '100%';
blocklyDiv.style.width = '100%';
editorContainer.appendChild(blocklyDiv);
}
generateToolbox() {
return `
<xml id="toolbox" style="display: none">
<category name="Navigation" colour="${BlockColors.NAVIGATION}">
<block type="c4a_go"></block>
<block type="c4a_reload"></block>
<block type="c4a_back"></block>
<block type="c4a_forward"></block>
</category>
<category name="Wait" colour="${BlockColors.WAIT}">
<block type="c4a_wait_time">
<field name="SECONDS">3</field>
</block>
<block type="c4a_wait_selector">
<field name="SELECTOR">#content</field>
<field name="TIMEOUT">10</field>
</block>
<block type="c4a_wait_text">
<field name="TEXT">Loading complete</field>
<field name="TIMEOUT">5</field>
</block>
</category>
<category name="Mouse Actions" colour="${BlockColors.ACTIONS}">
<block type="c4a_click">
<field name="SELECTOR">button.submit</field>
</block>
<block type="c4a_click_xy"></block>
<block type="c4a_double_click"></block>
<block type="c4a_right_click"></block>
<block type="c4a_move"></block>
<block type="c4a_drag"></block>
<block type="c4a_scroll">
<field name="DIRECTION">DOWN</field>
<field name="AMOUNT">500</field>
</block>
</category>
<category name="Keyboard" colour="${BlockColors.KEYBOARD}">
<block type="c4a_type">
<field name="TEXT">hello@example.com</field>
</block>
<block type="c4a_type_var">
<field name="VAR">email</field>
</block>
<block type="c4a_clear"></block>
<block type="c4a_set">
<field name="SELECTOR">#email</field>
<field name="VALUE">user@example.com</field>
</block>
<block type="c4a_press">
<field name="KEY">Tab</field>
</block>
<block type="c4a_key_down">
<field name="KEY">Shift</field>
</block>
<block type="c4a_key_up">
<field name="KEY">Shift</field>
</block>
</category>
<category name="Control Flow" colour="${BlockColors.CONTROL}">
<block type="c4a_if_exists">
<field name="SELECTOR">.cookie-banner</field>
</block>
<block type="c4a_if_exists_else">
<field name="SELECTOR">#user</field>
</block>
<block type="c4a_if_not_exists">
<field name="SELECTOR">.modal</field>
</block>
<block type="c4a_if_js">
<field name="CONDITION">window.innerWidth < 768</field>
</block>
<block type="c4a_repeat_times">
<field name="TIMES">5</field>
</block>
<block type="c4a_repeat_while">
<field name="CONDITION">document.querySelector('.load-more')</field>
</block>
</category>
<category name="Variables" colour="${BlockColors.VARIABLES}">
<block type="c4a_setvar">
<field name="NAME">username</field>
<field name="VALUE">john@example.com</field>
</block>
<block type="c4a_eval">
<field name="CODE">console.log('Hello')</field>
</block>
</category>
<category name="Procedures" colour="${BlockColors.PROCEDURES}">
<block type="c4a_proc_def">
<field name="NAME">login</field>
</block>
<block type="c4a_proc_call">
<field name="NAME">login</field>
</block>
</category>
<category name="Comments" colour="#9E9E9E">
<block type="c4a_comment">
<field name="TEXT">Add comment here</field>
</block>
</category>
</xml>`;
}
initializeWorkspace() {
const blocklyDiv = document.getElementById('blockly-view');
// Dark theme configuration
const theme = Blockly.Theme.defineTheme('c4a-dark', {
'base': Blockly.Themes.Classic,
'componentStyles': {
'workspaceBackgroundColour': '#0e0e10',
'toolboxBackgroundColour': '#1a1a1b',
'toolboxForegroundColour': '#e0e0e0',
'flyoutBackgroundColour': '#1a1a1b',
'flyoutForegroundColour': '#e0e0e0',
'flyoutOpacity': 0.9,
'scrollbarColour': '#2a2a2c',
'scrollbarOpacity': 0.5,
'insertionMarkerColour': '#0fbbaa',
'insertionMarkerOpacity': 0.3,
'markerColour': '#0fbbaa',
'cursorColour': '#0fbbaa',
'selectedGlowColour': '#0fbbaa',
'selectedGlowOpacity': 0.4,
'replacementGlowColour': '#0fbbaa',
'replacementGlowOpacity': 0.5
},
'fontStyle': {
'family': 'Dank Mono, Monaco, Consolas, monospace',
'weight': 'normal',
'size': 13
}
});
this.workspace = Blockly.inject(blocklyDiv, {
toolbox: this.toolboxXml,
theme: theme,
grid: {
spacing: 20,
length: 3,
colour: '#2a2a2c',
snap: true
},
zoom: {
controls: true,
wheel: true,
startScale: 1.0,
maxScale: 3,
minScale: 0.3,
scaleSpeed: 1.2
},
trashcan: true,
sounds: false,
media: 'https://unpkg.com/blockly/media/'
});
// Add workspace change listener
this.workspace.addChangeListener((event) => {
if (!this.isUpdating && event.type !== Blockly.Events.UI) {
this.syncBlocksToCode();
}
});
}
setupEventHandlers() {
// Add blockly toggle button
const headerActions = document.querySelector('.editor-panel .header-actions');
const blocklyBtn = document.createElement('button');
blocklyBtn.id = 'blockly-btn';
blocklyBtn.className = 'action-btn';
blocklyBtn.title = 'Toggle Blockly Mode';
blocklyBtn.innerHTML = '<span class="icon">🧩</span>';
// Insert before the Run button
const runBtn = document.getElementById('run-btn');
headerActions.insertBefore(blocklyBtn, runBtn);
blocklyBtn.addEventListener('click', () => this.toggleBlocklyView());
}
setupSynchronization() {
// Listen to CodeMirror changes
this.app.editor.on('change', (instance, changeObj) => {
if (!this.isUpdating && this.blocklyVisible && changeObj.origin !== 'setValue') {
this.syncCodeToBlocks();
}
});
}
toggleBlocklyView() {
const editorView = document.getElementById('editor-view');
const blocklyView = document.getElementById('blockly-view');
const timelineView = document.getElementById('timeline-view');
const blocklyBtn = document.getElementById('blockly-btn');
this.blocklyVisible = !this.blocklyVisible;
if (this.blocklyVisible) {
// Show Blockly
editorView.classList.add('hidden');
timelineView.classList.add('hidden');
blocklyView.classList.remove('hidden');
blocklyBtn.classList.add('active');
// Resize workspace
Blockly.svgResize(this.workspace);
// Sync current code to blocks
this.syncCodeToBlocks();
} else {
// Show editor
blocklyView.classList.add('hidden');
editorView.classList.remove('hidden');
blocklyBtn.classList.remove('active');
// Refresh CodeMirror
setTimeout(() => this.app.editor.refresh(), 100);
}
}
syncBlocksToCode() {
if (this.isUpdating) return;
try {
this.isUpdating = true;
// Generate C4A-Script from blocks using our custom generator
if (typeof c4aGenerator !== 'undefined') {
const code = c4aGenerator.workspaceToCode(this.workspace);
// Process the code to maintain proper formatting
const lines = code.split('\n');
const formattedLines = [];
let lastWasComment = false;
for (let i = 0; i < lines.length; i++) {
const line = lines[i].trim();
if (!line) continue;
const isComment = line.startsWith('#');
// Add blank line when transitioning between comments and commands
if (formattedLines.length > 0 && lastWasComment !== isComment) {
formattedLines.push('');
}
formattedLines.push(line);
lastWasComment = isComment;
}
const cleanCode = formattedLines.join('\n');
// Update CodeMirror
this.app.editor.setValue(cleanCode);
}
} catch (error) {
console.error('Error syncing blocks to code:', error);
} finally {
this.isUpdating = false;
}
}
syncCodeToBlocks() {
if (this.isUpdating) return;
try {
this.isUpdating = true;
// Clear workspace
this.workspace.clear();
// Parse C4A-Script and generate blocks
const code = this.app.editor.getValue();
const blocks = this.parseC4AToBlocks(code);
if (blocks) {
Blockly.Xml.domToWorkspace(blocks, this.workspace);
}
} catch (error) {
console.error('Error syncing code to blocks:', error);
// Show error in console
this.app.addConsoleMessage(`Blockly sync error: ${error.message}`, 'warning');
} finally {
this.isUpdating = false;
}
}
parseC4AToBlocks(code) {
const lines = code.split('\n');
const xml = document.createElement('xml');
let yPos = 20;
let previousBlock = null;
let rootBlock = null;
for (let i = 0; i < lines.length; i++) {
const line = lines[i].trim();
// Skip empty lines
if (!line) continue;
// Handle comments
if (line.startsWith('#')) {
const commentBlock = this.parseLineToBlock(line, i, lines);
if (commentBlock) {
if (previousBlock) {
// Connect to previous block
const next = document.createElement('next');
next.appendChild(commentBlock);
previousBlock.appendChild(next);
} else {
// First block - set position
commentBlock.setAttribute('x', 20);
commentBlock.setAttribute('y', yPos);
xml.appendChild(commentBlock);
rootBlock = commentBlock;
yPos += 60;
}
previousBlock = commentBlock;
}
continue;
}
const block = this.parseLineToBlock(line, i, lines);
if (block) {
if (previousBlock) {
// Connect to previous block using <next>
const next = document.createElement('next');
next.appendChild(block);
previousBlock.appendChild(next);
} else {
// First block - set position
block.setAttribute('x', 20);
block.setAttribute('y', yPos);
xml.appendChild(block);
rootBlock = block;
yPos += 60;
}
previousBlock = block;
}
}
return xml;
}
parseLineToBlock(line, index, allLines) {
// Navigation commands
if (line.startsWith('GO ')) {
const url = line.substring(3).trim();
return this.createBlock('c4a_go', { 'URL': url });
}
if (line === 'RELOAD') {
return this.createBlock('c4a_reload');
}
if (line === 'BACK') {
return this.createBlock('c4a_back');
}
if (line === 'FORWARD') {
return this.createBlock('c4a_forward');
}
// Wait commands
if (line.startsWith('WAIT ')) {
const parts = line.substring(5).trim();
// Check if it's just a number (wait time)
if (/^\d+(\.\d+)?$/.test(parts)) {
return this.createBlock('c4a_wait_time', { 'SECONDS': parts });
}
// Check for selector wait
const selectorMatch = parts.match(/^`([^`]+)`\s+(\d+)$/);
if (selectorMatch) {
return this.createBlock('c4a_wait_selector', {
'SELECTOR': selectorMatch[1],
'TIMEOUT': selectorMatch[2]
});
}
// Check for text wait
const textMatch = parts.match(/^"([^"]+)"\s+(\d+)$/);
if (textMatch) {
return this.createBlock('c4a_wait_text', {
'TEXT': textMatch[1],
'TIMEOUT': textMatch[2]
});
}
}
// Click commands
if (line.startsWith('CLICK ')) {
const target = line.substring(6).trim();
// Check for coordinates
const coordMatch = target.match(/^(\d+)\s+(\d+)$/);
if (coordMatch) {
return this.createBlock('c4a_click_xy', {
'X': coordMatch[1],
'Y': coordMatch[2]
});
}
// Selector click
const selectorMatch = target.match(/^`([^`]+)`$/);
if (selectorMatch) {
return this.createBlock('c4a_click', {
'SELECTOR': selectorMatch[1]
});
}
}
// Other mouse actions
if (line.startsWith('DOUBLE_CLICK ')) {
const selector = line.substring(13).trim().match(/^`([^`]+)`$/);
if (selector) {
return this.createBlock('c4a_double_click', {
'SELECTOR': selector[1]
});
}
}
if (line.startsWith('RIGHT_CLICK ')) {
const selector = line.substring(12).trim().match(/^`([^`]+)`$/);
if (selector) {
return this.createBlock('c4a_right_click', {
'SELECTOR': selector[1]
});
}
}
// Scroll
if (line.startsWith('SCROLL ')) {
const match = line.match(/^SCROLL\s+(UP|DOWN|LEFT|RIGHT)(?:\s+(\d+))?$/);
if (match) {
return this.createBlock('c4a_scroll', {
'DIRECTION': match[1],
'AMOUNT': match[2] || '500'
});
}
}
// Type commands
if (line.startsWith('TYPE ')) {
const content = line.substring(5).trim();
// Variable type
if (content.startsWith('$')) {
return this.createBlock('c4a_type_var', {
'VAR': content.substring(1)
});
}
// Text type
const textMatch = content.match(/^"([^"]*)"$/);
if (textMatch) {
return this.createBlock('c4a_type', {
'TEXT': textMatch[1]
});
}
}
// SET command
if (line.startsWith('SET ')) {
const match = line.match(/^SET\s+`([^`]+)`\s+"([^"]*)"$/);
if (match) {
return this.createBlock('c4a_set', {
'SELECTOR': match[1],
'VALUE': match[2]
});
}
}
// CLEAR command
if (line.startsWith('CLEAR ')) {
const match = line.match(/^CLEAR\s+`([^`]+)`$/);
if (match) {
return this.createBlock('c4a_clear', {
'SELECTOR': match[1]
});
}
}
// SETVAR command
if (line.startsWith('SETVAR ')) {
const match = line.match(/^SETVAR\s+(\w+)\s*=\s*"([^"]*)"$/);
if (match) {
return this.createBlock('c4a_setvar', {
'NAME': match[1],
'VALUE': match[2]
});
}
}
// IF commands (simplified - only single line)
if (line.startsWith('IF ')) {
// IF EXISTS
const existsMatch = line.match(/^IF\s+\(EXISTS\s+`([^`]+)`\)\s+THEN\s+(.+?)(?:\s+ELSE\s+(.+))?$/);
if (existsMatch) {
if (existsMatch[3]) {
// Has ELSE
const block = this.createBlock('c4a_if_exists_else', {
'SELECTOR': existsMatch[1]
});
// Parse then and else commands - simplified for now
return block;
} else {
// No ELSE
const block = this.createBlock('c4a_if_exists', {
'SELECTOR': existsMatch[1]
});
return block;
}
}
// IF NOT EXISTS
const notExistsMatch = line.match(/^IF\s+\(NOT\s+EXISTS\s+`([^`]+)`\)\s+THEN\s+(.+)$/);
if (notExistsMatch) {
const block = this.createBlock('c4a_if_not_exists', {
'SELECTOR': notExistsMatch[1]
});
return block;
}
}
// Comments
if (line.startsWith('#')) {
return this.createBlock('c4a_comment', {
'TEXT': line.substring(1).trim()
});
}
// If we can't parse it, return null
return null;
}
createBlock(type, fields = {}) {
const block = document.createElement('block');
block.setAttribute('type', type);
// Add fields
for (const [name, value] of Object.entries(fields)) {
const field = document.createElement('field');
field.setAttribute('name', name);
field.textContent = value;
block.appendChild(field);
}
return block;
}
}

View File

@@ -0,0 +1,238 @@
/* Blockly Theme CSS for C4A-Script */
/* Blockly workspace container */
.blockly-workspace {
position: relative;
width: 100%;
height: 100%;
background: var(--bg-primary);
}
/* Blockly button active state */
#blockly-btn.active {
background: var(--primary-color);
color: var(--bg-primary);
border-color: var(--primary-color);
}
#blockly-btn.active:hover {
background: var(--primary-hover);
border-color: var(--primary-hover);
}
/* Override Blockly's default styles for dark theme */
.blocklyToolboxDiv {
background-color: var(--bg-tertiary) !important;
border-right: 1px solid var(--border-color) !important;
}
.blocklyFlyout {
background-color: var(--bg-secondary) !important;
}
.blocklyFlyoutBackground {
fill: var(--bg-secondary) !important;
}
.blocklyMainBackground {
stroke: none !important;
}
.blocklyTreeRow {
color: var(--text-primary) !important;
font-family: 'Dank Mono', monospace !important;
padding: 4px 16px !important;
margin: 2px 0 !important;
}
.blocklyTreeRow:hover {
background-color: var(--bg-secondary) !important;
}
.blocklyTreeSelected {
background-color: var(--primary-dim) !important;
}
.blocklyTreeLabel {
cursor: pointer;
}
/* Blockly scrollbars */
.blocklyScrollbarHorizontal,
.blocklyScrollbarVertical {
background-color: transparent !important;
}
.blocklyScrollbarHandle {
fill: var(--border-color) !important;
opacity: 0.5 !important;
}
.blocklyScrollbarHandle:hover {
fill: var(--border-hover) !important;
opacity: 0.8 !important;
}
/* Blockly zoom controls */
.blocklyZoom > image {
opacity: 0.6;
}
.blocklyZoom > image:hover {
opacity: 1;
}
/* Blockly trash can */
.blocklyTrash {
opacity: 0.6;
}
.blocklyTrash:hover {
opacity: 1;
}
/* Blockly context menus */
.blocklyContextMenu {
background-color: var(--bg-tertiary) !important;
border: 1px solid var(--border-color) !important;
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.3) !important;
}
.blocklyMenuItem {
color: var(--text-primary) !important;
font-family: 'Dank Mono', monospace !important;
}
.blocklyMenuItemDisabled {
color: var(--text-muted) !important;
}
.blocklyMenuItem:hover {
background-color: var(--bg-secondary) !important;
}
/* Blockly text inputs */
.blocklyHtmlInput {
background-color: var(--bg-tertiary) !important;
color: var(--text-primary) !important;
border: 1px solid var(--border-color) !important;
font-family: 'Dank Mono', monospace !important;
font-size: 13px !important;
padding: 4px 8px !important;
}
.blocklyHtmlInput:focus {
border-color: var(--primary-color) !important;
outline: none !important;
}
/* Blockly dropdowns */
.blocklyDropDownDiv {
background-color: var(--bg-tertiary) !important;
border: 1px solid var(--border-color) !important;
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.3) !important;
}
.blocklyDropDownContent {
color: var(--text-primary) !important;
}
.blocklyDropDownDiv .goog-menuitem {
color: var(--text-primary) !important;
font-family: 'Dank Mono', monospace !important;
padding: 4px 16px !important;
}
.blocklyDropDownDiv .goog-menuitem-highlight,
.blocklyDropDownDiv .goog-menuitem-hover {
background-color: var(--bg-secondary) !important;
}
/* Custom block colors are defined in the block definitions */
/* Block text styling */
.blocklyText {
fill: #ffffff !important;
font-family: 'Dank Mono', monospace !important;
font-size: 13px !important;
}
.blocklyEditableText > .blocklyText {
fill: #ffffff !important;
}
.blocklyEditableText:hover > rect {
stroke: var(--primary-color) !important;
stroke-width: 2px !important;
}
/* Improve visibility of connection highlights */
.blocklyHighlightedConnectionPath {
stroke: var(--primary-color) !important;
stroke-width: 4px !important;
}
.blocklyInsertionMarker > .blocklyPath {
fill-opacity: 0.3 !important;
stroke-opacity: 0.6 !important;
}
/* Workspace grid pattern */
.blocklyWorkspace > .blocklyBlockCanvas > .blocklyGridCanvas {
opacity: 0.1;
}
/* Smooth transitions */
.blocklyDraggable {
transition: transform 0.1s ease;
}
/* Field labels */
.blocklyFieldLabel {
font-weight: normal !important;
}
/* Comment blocks styling */
.blocklyCommentText {
font-style: italic !important;
}
/* Make comment blocks slightly transparent */
g[data-category="Comments"] .blocklyPath {
fill-opacity: 0.8 !important;
}
/* Better visibility for disabled blocks */
.blocklyDisabled > .blocklyPath {
fill-opacity: 0.3 !important;
}
.blocklyDisabled > .blocklyText {
fill-opacity: 0.5 !important;
}
/* Warning and error text */
.blocklyWarningText,
.blocklyErrorText {
font-family: 'Dank Mono', monospace !important;
font-size: 12px !important;
}
/* Workspace scrollbar improvement for dark theme */
::-webkit-scrollbar {
width: 10px;
height: 10px;
}
::-webkit-scrollbar-track {
background: var(--bg-secondary);
}
::-webkit-scrollbar-thumb {
background: var(--border-color);
border-radius: 5px;
}
::-webkit-scrollbar-thumb:hover {
background: var(--border-hover);
}

View File

@@ -0,0 +1,549 @@
// C4A-Script Blockly Block Definitions
// This file defines all custom blocks for C4A-Script commands
// Color scheme for different block categories
const BlockColors = {
NAVIGATION: '#1E88E5', // Blue
ACTIONS: '#43A047', // Green
CONTROL: '#FB8C00', // Orange
VARIABLES: '#8E24AA', // Purple
WAIT: '#E53935', // Red
KEYBOARD: '#00ACC1', // Cyan
PROCEDURES: '#6A1B9A' // Deep Purple
};
// Helper to create selector input with backticks
Blockly.Blocks['c4a_selector_input'] = {
init: function() {
this.appendDummyInput()
.appendField("`")
.appendField(new Blockly.FieldTextInput("selector"), "SELECTOR")
.appendField("`");
this.setOutput(true, "Selector");
this.setColour(BlockColors.ACTIONS);
this.setTooltip("CSS selector for element");
}
};
// ============================================
// NAVIGATION BLOCKS
// ============================================
Blockly.Blocks['c4a_go'] = {
init: function() {
this.appendDummyInput()
.appendField("GO")
.appendField(new Blockly.FieldTextInput("https://example.com"), "URL");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.NAVIGATION);
this.setTooltip("Navigate to URL");
}
};
Blockly.Blocks['c4a_reload'] = {
init: function() {
this.appendDummyInput()
.appendField("RELOAD");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.NAVIGATION);
this.setTooltip("Reload current page");
}
};
Blockly.Blocks['c4a_back'] = {
init: function() {
this.appendDummyInput()
.appendField("BACK");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.NAVIGATION);
this.setTooltip("Go back in browser history");
}
};
Blockly.Blocks['c4a_forward'] = {
init: function() {
this.appendDummyInput()
.appendField("FORWARD");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.NAVIGATION);
this.setTooltip("Go forward in browser history");
}
};
// ============================================
// WAIT BLOCKS
// ============================================
Blockly.Blocks['c4a_wait_time'] = {
init: function() {
this.appendDummyInput()
.appendField("WAIT")
.appendField(new Blockly.FieldNumber(1, 0), "SECONDS")
.appendField("seconds");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.WAIT);
this.setTooltip("Wait for specified seconds");
}
};
Blockly.Blocks['c4a_wait_selector'] = {
init: function() {
this.appendDummyInput()
.appendField("WAIT for")
.appendField("`")
.appendField(new Blockly.FieldTextInput("selector"), "SELECTOR")
.appendField("`")
.appendField("max")
.appendField(new Blockly.FieldNumber(10, 1), "TIMEOUT")
.appendField("sec");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.WAIT);
this.setTooltip("Wait for element to appear");
}
};
Blockly.Blocks['c4a_wait_text'] = {
init: function() {
this.appendDummyInput()
.appendField("WAIT for text")
.appendField(new Blockly.FieldTextInput("Loading complete"), "TEXT")
.appendField("max")
.appendField(new Blockly.FieldNumber(5, 1), "TIMEOUT")
.appendField("sec");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.WAIT);
this.setTooltip("Wait for text to appear on page");
}
};
// ============================================
// MOUSE ACTION BLOCKS
// ============================================
Blockly.Blocks['c4a_click'] = {
init: function() {
this.appendDummyInput()
.appendField("CLICK")
.appendField("`")
.appendField(new Blockly.FieldTextInput("button"), "SELECTOR")
.appendField("`");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.ACTIONS);
this.setTooltip("Click on element");
}
};
Blockly.Blocks['c4a_click_xy'] = {
init: function() {
this.appendDummyInput()
.appendField("CLICK at")
.appendField("X:")
.appendField(new Blockly.FieldNumber(100, 0), "X")
.appendField("Y:")
.appendField(new Blockly.FieldNumber(100, 0), "Y");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.ACTIONS);
this.setTooltip("Click at coordinates");
}
};
Blockly.Blocks['c4a_double_click'] = {
init: function() {
this.appendDummyInput()
.appendField("DOUBLE_CLICK")
.appendField("`")
.appendField(new Blockly.FieldTextInput(".item"), "SELECTOR")
.appendField("`");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.ACTIONS);
this.setTooltip("Double click on element");
}
};
Blockly.Blocks['c4a_right_click'] = {
init: function() {
this.appendDummyInput()
.appendField("RIGHT_CLICK")
.appendField("`")
.appendField(new Blockly.FieldTextInput("#menu"), "SELECTOR")
.appendField("`");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.ACTIONS);
this.setTooltip("Right click on element");
}
};
Blockly.Blocks['c4a_move'] = {
init: function() {
this.appendDummyInput()
.appendField("MOVE to")
.appendField("X:")
.appendField(new Blockly.FieldNumber(500, 0), "X")
.appendField("Y:")
.appendField(new Blockly.FieldNumber(300, 0), "Y");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.ACTIONS);
this.setTooltip("Move mouse to position");
}
};
Blockly.Blocks['c4a_drag'] = {
init: function() {
this.appendDummyInput()
.appendField("DRAG from")
.appendField("X:")
.appendField(new Blockly.FieldNumber(100, 0), "X1")
.appendField("Y:")
.appendField(new Blockly.FieldNumber(100, 0), "Y1");
this.appendDummyInput()
.appendField("to")
.appendField("X:")
.appendField(new Blockly.FieldNumber(500, 0), "X2")
.appendField("Y:")
.appendField(new Blockly.FieldNumber(300, 0), "Y2");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.ACTIONS);
this.setTooltip("Drag from one point to another");
}
};
Blockly.Blocks['c4a_scroll'] = {
init: function() {
this.appendDummyInput()
.appendField("SCROLL")
.appendField(new Blockly.FieldDropdown([
["DOWN", "DOWN"],
["UP", "UP"],
["LEFT", "LEFT"],
["RIGHT", "RIGHT"]
]), "DIRECTION")
.appendField(new Blockly.FieldNumber(500, 0), "AMOUNT")
.appendField("pixels");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.ACTIONS);
this.setTooltip("Scroll in direction");
}
};
// ============================================
// KEYBOARD BLOCKS
// ============================================
Blockly.Blocks['c4a_type'] = {
init: function() {
this.appendDummyInput()
.appendField("TYPE")
.appendField(new Blockly.FieldTextInput("text to type"), "TEXT");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.KEYBOARD);
this.setTooltip("Type text");
}
};
Blockly.Blocks['c4a_type_var'] = {
init: function() {
this.appendDummyInput()
.appendField("TYPE")
.appendField("$")
.appendField(new Blockly.FieldTextInput("variable"), "VAR");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.KEYBOARD);
this.setTooltip("Type variable value");
}
};
Blockly.Blocks['c4a_clear'] = {
init: function() {
this.appendDummyInput()
.appendField("CLEAR")
.appendField("`")
.appendField(new Blockly.FieldTextInput("input"), "SELECTOR")
.appendField("`");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.KEYBOARD);
this.setTooltip("Clear input field");
}
};
Blockly.Blocks['c4a_set'] = {
init: function() {
this.appendDummyInput()
.appendField("SET")
.appendField("`")
.appendField(new Blockly.FieldTextInput("#input"), "SELECTOR")
.appendField("`")
.appendField("to")
.appendField(new Blockly.FieldTextInput("value"), "VALUE");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.KEYBOARD);
this.setTooltip("Set input field value");
}
};
Blockly.Blocks['c4a_press'] = {
init: function() {
this.appendDummyInput()
.appendField("PRESS")
.appendField(new Blockly.FieldDropdown([
["Tab", "Tab"],
["Enter", "Enter"],
["Escape", "Escape"],
["Space", "Space"],
["ArrowUp", "ArrowUp"],
["ArrowDown", "ArrowDown"],
["ArrowLeft", "ArrowLeft"],
["ArrowRight", "ArrowRight"],
["Delete", "Delete"],
["Backspace", "Backspace"]
]), "KEY");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.KEYBOARD);
this.setTooltip("Press and release key");
}
};
Blockly.Blocks['c4a_key_down'] = {
init: function() {
this.appendDummyInput()
.appendField("KEY_DOWN")
.appendField(new Blockly.FieldDropdown([
["Shift", "Shift"],
["Control", "Control"],
["Alt", "Alt"],
["Meta", "Meta"]
]), "KEY");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.KEYBOARD);
this.setTooltip("Hold key down");
}
};
Blockly.Blocks['c4a_key_up'] = {
init: function() {
this.appendDummyInput()
.appendField("KEY_UP")
.appendField(new Blockly.FieldDropdown([
["Shift", "Shift"],
["Control", "Control"],
["Alt", "Alt"],
["Meta", "Meta"]
]), "KEY");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.KEYBOARD);
this.setTooltip("Release key");
}
};
// ============================================
// CONTROL FLOW BLOCKS
// ============================================
Blockly.Blocks['c4a_if_exists'] = {
init: function() {
this.appendDummyInput()
.appendField("IF EXISTS")
.appendField("`")
.appendField(new Blockly.FieldTextInput(".element"), "SELECTOR")
.appendField("`")
.appendField("THEN");
this.appendStatementInput("THEN")
.setCheck(null);
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.CONTROL);
this.setTooltip("If element exists, then do something");
}
};
Blockly.Blocks['c4a_if_exists_else'] = {
init: function() {
this.appendDummyInput()
.appendField("IF EXISTS")
.appendField("`")
.appendField(new Blockly.FieldTextInput(".element"), "SELECTOR")
.appendField("`")
.appendField("THEN");
this.appendStatementInput("THEN")
.setCheck(null);
this.appendDummyInput()
.appendField("ELSE");
this.appendStatementInput("ELSE")
.setCheck(null);
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.CONTROL);
this.setTooltip("If element exists, then do something, else do something else");
}
};
Blockly.Blocks['c4a_if_not_exists'] = {
init: function() {
this.appendDummyInput()
.appendField("IF NOT EXISTS")
.appendField("`")
.appendField(new Blockly.FieldTextInput(".element"), "SELECTOR")
.appendField("`")
.appendField("THEN");
this.appendStatementInput("THEN")
.setCheck(null);
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.CONTROL);
this.setTooltip("If element does not exist, then do something");
}
};
Blockly.Blocks['c4a_if_js'] = {
init: function() {
this.appendDummyInput()
.appendField("IF")
.appendField("`")
.appendField(new Blockly.FieldTextInput("window.innerWidth < 768"), "CONDITION")
.appendField("`")
.appendField("THEN");
this.appendStatementInput("THEN")
.setCheck(null);
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.CONTROL);
this.setTooltip("If JavaScript condition is true");
}
};
Blockly.Blocks['c4a_repeat_times'] = {
init: function() {
this.appendDummyInput()
.appendField("REPEAT")
.appendField(new Blockly.FieldNumber(5, 1), "TIMES")
.appendField("times");
this.appendStatementInput("DO")
.setCheck(null);
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.CONTROL);
this.setTooltip("Repeat commands N times");
}
};
Blockly.Blocks['c4a_repeat_while'] = {
init: function() {
this.appendDummyInput()
.appendField("REPEAT WHILE")
.appendField("`")
.appendField(new Blockly.FieldTextInput("document.querySelector('.load-more')"), "CONDITION")
.appendField("`");
this.appendStatementInput("DO")
.setCheck(null);
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.CONTROL);
this.setTooltip("Repeat while condition is true");
}
};
// ============================================
// VARIABLE BLOCKS
// ============================================
Blockly.Blocks['c4a_setvar'] = {
init: function() {
this.appendDummyInput()
.appendField("SETVAR")
.appendField(new Blockly.FieldTextInput("username"), "NAME")
.appendField("=")
.appendField(new Blockly.FieldTextInput("value"), "VALUE");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.VARIABLES);
this.setTooltip("Set variable value");
}
};
// ============================================
// ADVANCED BLOCKS
// ============================================
Blockly.Blocks['c4a_eval'] = {
init: function() {
this.appendDummyInput()
.appendField("EVAL")
.appendField("`")
.appendField(new Blockly.FieldTextInput("console.log('Hello')"), "CODE")
.appendField("`");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.VARIABLES);
this.setTooltip("Execute JavaScript code");
}
};
Blockly.Blocks['c4a_comment'] = {
init: function() {
this.appendDummyInput()
.appendField("#")
.appendField(new Blockly.FieldTextInput("Comment", null, {
spellcheck: false,
class: 'blocklyCommentText'
}), "TEXT");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour("#616161");
this.setTooltip("Add a comment");
this.setStyle('comment_blocks');
}
};
// ============================================
// PROCEDURE BLOCKS
// ============================================
Blockly.Blocks['c4a_proc_def'] = {
init: function() {
this.appendDummyInput()
.appendField("PROC")
.appendField(new Blockly.FieldTextInput("procedure_name"), "NAME");
this.appendStatementInput("BODY")
.setCheck(null);
this.appendDummyInput()
.appendField("ENDPROC");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.PROCEDURES);
this.setTooltip("Define a procedure");
}
};
Blockly.Blocks['c4a_proc_call'] = {
init: function() {
this.appendDummyInput()
.appendField("Call")
.appendField(new Blockly.FieldTextInput("procedure_name"), "NAME");
this.setPreviousStatement(true, null);
this.setNextStatement(true, null);
this.setColour(BlockColors.PROCEDURES);
this.setTooltip("Call a procedure");
}
};
// Code generators have been moved to c4a-generator.js

View File

@@ -0,0 +1,261 @@
// C4A-Script Code Generator for Blockly
// Compatible with latest Blockly API
// Create a custom code generator for C4A-Script
const c4aGenerator = new Blockly.Generator('C4A');
// Helper to get field value with proper escaping
c4aGenerator.getFieldValue = function(block, fieldName) {
return block.getFieldValue(fieldName);
};
// Navigation generators
c4aGenerator.forBlock['c4a_go'] = function(block, generator) {
const url = generator.getFieldValue(block, 'URL');
return `GO ${url}\n`;
};
c4aGenerator.forBlock['c4a_reload'] = function(block, generator) {
return 'RELOAD\n';
};
c4aGenerator.forBlock['c4a_back'] = function(block, generator) {
return 'BACK\n';
};
c4aGenerator.forBlock['c4a_forward'] = function(block, generator) {
return 'FORWARD\n';
};
// Wait generators
c4aGenerator.forBlock['c4a_wait_time'] = function(block, generator) {
const seconds = generator.getFieldValue(block, 'SECONDS');
return `WAIT ${seconds}\n`;
};
c4aGenerator.forBlock['c4a_wait_selector'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
const timeout = generator.getFieldValue(block, 'TIMEOUT');
return `WAIT \`${selector}\` ${timeout}\n`;
};
c4aGenerator.forBlock['c4a_wait_text'] = function(block, generator) {
const text = generator.getFieldValue(block, 'TEXT');
const timeout = generator.getFieldValue(block, 'TIMEOUT');
return `WAIT "${text}" ${timeout}\n`;
};
// Mouse action generators
c4aGenerator.forBlock['c4a_click'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
return `CLICK \`${selector}\`\n`;
};
c4aGenerator.forBlock['c4a_click_xy'] = function(block, generator) {
const x = generator.getFieldValue(block, 'X');
const y = generator.getFieldValue(block, 'Y');
return `CLICK ${x} ${y}\n`;
};
c4aGenerator.forBlock['c4a_double_click'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
return `DOUBLE_CLICK \`${selector}\`\n`;
};
c4aGenerator.forBlock['c4a_right_click'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
return `RIGHT_CLICK \`${selector}\`\n`;
};
c4aGenerator.forBlock['c4a_move'] = function(block, generator) {
const x = generator.getFieldValue(block, 'X');
const y = generator.getFieldValue(block, 'Y');
return `MOVE ${x} ${y}\n`;
};
c4aGenerator.forBlock['c4a_drag'] = function(block, generator) {
const x1 = generator.getFieldValue(block, 'X1');
const y1 = generator.getFieldValue(block, 'Y1');
const x2 = generator.getFieldValue(block, 'X2');
const y2 = generator.getFieldValue(block, 'Y2');
return `DRAG ${x1} ${y1} ${x2} ${y2}\n`;
};
c4aGenerator.forBlock['c4a_scroll'] = function(block, generator) {
const direction = generator.getFieldValue(block, 'DIRECTION');
const amount = generator.getFieldValue(block, 'AMOUNT');
return `SCROLL ${direction} ${amount}\n`;
};
// Keyboard generators
c4aGenerator.forBlock['c4a_type'] = function(block, generator) {
const text = generator.getFieldValue(block, 'TEXT');
return `TYPE "${text}"\n`;
};
c4aGenerator.forBlock['c4a_type_var'] = function(block, generator) {
const varName = generator.getFieldValue(block, 'VAR');
return `TYPE $${varName}\n`;
};
c4aGenerator.forBlock['c4a_clear'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
return `CLEAR \`${selector}\`\n`;
};
c4aGenerator.forBlock['c4a_set'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
const value = generator.getFieldValue(block, 'VALUE');
return `SET \`${selector}\` "${value}"\n`;
};
c4aGenerator.forBlock['c4a_press'] = function(block, generator) {
const key = generator.getFieldValue(block, 'KEY');
return `PRESS ${key}\n`;
};
c4aGenerator.forBlock['c4a_key_down'] = function(block, generator) {
const key = generator.getFieldValue(block, 'KEY');
return `KEY_DOWN ${key}\n`;
};
c4aGenerator.forBlock['c4a_key_up'] = function(block, generator) {
const key = generator.getFieldValue(block, 'KEY');
return `KEY_UP ${key}\n`;
};
// Control flow generators
c4aGenerator.forBlock['c4a_if_exists'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
const thenCode = generator.statementToCode(block, 'THEN').trim();
if (thenCode.includes('\n')) {
// Multi-line then block
const lines = thenCode.split('\n').filter(line => line.trim());
return lines.map(line => `IF (EXISTS \`${selector}\`) THEN ${line}`).join('\n') + '\n';
} else if (thenCode) {
// Single line
return `IF (EXISTS \`${selector}\`) THEN ${thenCode}\n`;
}
return '';
};
c4aGenerator.forBlock['c4a_if_exists_else'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
const thenCode = generator.statementToCode(block, 'THEN').trim();
const elseCode = generator.statementToCode(block, 'ELSE').trim();
// For simplicity, only handle single-line then/else
const thenLine = thenCode.split('\n')[0];
const elseLine = elseCode.split('\n')[0];
if (thenLine && elseLine) {
return `IF (EXISTS \`${selector}\`) THEN ${thenLine} ELSE ${elseLine}\n`;
} else if (thenLine) {
return `IF (EXISTS \`${selector}\`) THEN ${thenLine}\n`;
}
return '';
};
c4aGenerator.forBlock['c4a_if_not_exists'] = function(block, generator) {
const selector = generator.getFieldValue(block, 'SELECTOR');
const thenCode = generator.statementToCode(block, 'THEN').trim();
if (thenCode.includes('\n')) {
const lines = thenCode.split('\n').filter(line => line.trim());
return lines.map(line => `IF (NOT EXISTS \`${selector}\`) THEN ${line}`).join('\n') + '\n';
} else if (thenCode) {
return `IF (NOT EXISTS \`${selector}\`) THEN ${thenCode}\n`;
}
return '';
};
c4aGenerator.forBlock['c4a_if_js'] = function(block, generator) {
const condition = generator.getFieldValue(block, 'CONDITION');
const thenCode = generator.statementToCode(block, 'THEN').trim();
if (thenCode.includes('\n')) {
const lines = thenCode.split('\n').filter(line => line.trim());
return lines.map(line => `IF (\`${condition}\`) THEN ${line}`).join('\n') + '\n';
} else if (thenCode) {
return `IF (\`${condition}\`) THEN ${thenCode}\n`;
}
return '';
};
c4aGenerator.forBlock['c4a_repeat_times'] = function(block, generator) {
const times = generator.getFieldValue(block, 'TIMES');
const doCode = generator.statementToCode(block, 'DO').trim();
if (doCode) {
// Get first command for repeat
const firstLine = doCode.split('\n')[0];
return `REPEAT (${firstLine}, ${times})\n`;
}
return '';
};
c4aGenerator.forBlock['c4a_repeat_while'] = function(block, generator) {
const condition = generator.getFieldValue(block, 'CONDITION');
const doCode = generator.statementToCode(block, 'DO').trim();
if (doCode) {
// Get first command for repeat
const firstLine = doCode.split('\n')[0];
return `REPEAT (${firstLine}, \`${condition}\`)\n`;
}
return '';
};
// Variable generators
c4aGenerator.forBlock['c4a_setvar'] = function(block, generator) {
const name = generator.getFieldValue(block, 'NAME');
const value = generator.getFieldValue(block, 'VALUE');
return `SETVAR ${name} = "${value}"\n`;
};
// Advanced generators
c4aGenerator.forBlock['c4a_eval'] = function(block, generator) {
const code = generator.getFieldValue(block, 'CODE');
return `EVAL \`${code}\`\n`;
};
c4aGenerator.forBlock['c4a_comment'] = function(block, generator) {
const text = generator.getFieldValue(block, 'TEXT');
return `# ${text}\n`;
};
// Procedure generators
c4aGenerator.forBlock['c4a_proc_def'] = function(block, generator) {
const name = generator.getFieldValue(block, 'NAME');
const body = generator.statementToCode(block, 'BODY');
return `PROC ${name}\n${body}ENDPROC\n`;
};
c4aGenerator.forBlock['c4a_proc_call'] = function(block, generator) {
const name = generator.getFieldValue(block, 'NAME');
return `${name}\n`;
};
// Override scrub_ to handle our custom format
c4aGenerator.scrub_ = function(block, code, opt_thisOnly) {
const nextBlock = block.nextConnection && block.nextConnection.targetBlock();
let nextCode = '';
if (nextBlock) {
if (!opt_thisOnly) {
nextCode = c4aGenerator.blockToCode(nextBlock);
// Add blank line between comment and non-comment blocks
const currentIsComment = block.type === 'c4a_comment';
const nextIsComment = nextBlock.type === 'c4a_comment';
// Add blank line when transitioning from command to comment or vice versa
if (currentIsComment !== nextIsComment && code.trim() && nextCode.trim()) {
nextCode = '\n' + nextCode;
}
}
}
return code + nextCode;
};

View File

@@ -0,0 +1,531 @@
/* DankMono Font Faces */
@font-face {
font-family: 'DankMono';
src: url('DankMono-Regular.woff2') format('woff2');
font-weight: 400;
font-style: normal;
}
@font-face {
font-family: 'DankMono';
src: url('DankMono-Bold.woff2') format('woff2');
font-weight: 700;
font-style: normal;
}
@font-face {
font-family: 'DankMono';
src: url('DankMono-Italic.woff2') format('woff2');
font-weight: 400;
font-style: italic;
}
/* Root Variables - Matching docs theme */
:root {
--global-font-size: 14px;
--global-code-font-size: 13px;
--global-line-height: 1.5em;
--global-space: 10px;
--font-stack: DankMono, Monaco, Courier New, monospace;
--mono-font-stack: DankMono, Monaco, Courier New, monospace;
--background-color: #070708;
--font-color: #e8e9ed;
--invert-font-color: #222225;
--secondary-color: #d5cec0;
--tertiary-color: #a3abba;
--primary-color: #0fbbaa;
--error-color: #ff3c74;
--progress-bar-background: #3f3f44;
--progress-bar-fill: #09b5a5;
--code-bg-color: #3f3f44;
--block-background-color: #202020;
--header-height: 55px;
}
/* Base Styles */
* {
box-sizing: border-box;
}
body {
margin: 0;
padding: 0;
font-family: var(--font-stack);
font-size: var(--global-font-size);
line-height: var(--global-line-height);
color: var(--font-color);
background-color: var(--background-color);
}
/* Terminal Framework */
.terminal {
min-height: 100vh;
}
.container {
width: 100%;
margin: 0 auto;
}
/* Header */
.header-container {
position: fixed;
top: 0;
left: 0;
right: 0;
height: var(--header-height);
background-color: var(--background-color);
border-bottom: 1px solid var(--progress-bar-background);
z-index: 1000;
padding: 0 calc(var(--global-space) * 2);
}
.terminal-nav {
display: flex;
align-items: center;
justify-content: space-between;
height: 100%;
}
.terminal-logo h1 {
margin: 0;
font-size: 1.2em;
color: var(--primary-color);
font-weight: 400;
}
.terminal-menu ul {
list-style: none;
margin: 0;
padding: 0;
display: flex;
gap: 2em;
}
.terminal-menu a {
color: var(--secondary-color);
text-decoration: none;
transition: color 0.2s;
}
.terminal-menu a:hover,
.terminal-menu a.active {
color: var(--primary-color);
}
/* Main Container */
.main-container {
padding-top: calc(var(--header-height) + 2em);
padding-left: 2em;
padding-right: 2em;
max-width: 1400px;
margin: 0 auto;
}
/* Tutorial Grid */
.tutorial-grid {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 2em;
align-items: start;
}
/* Terminal Cards */
.terminal-card {
background-color: var(--block-background-color);
border: 1px solid var(--progress-bar-background);
margin-bottom: 1.5em;
}
.terminal-card header {
background-color: var(--progress-bar-background);
padding: 0.8em 1em;
font-weight: 700;
color: var(--font-color);
display: flex;
justify-content: space-between;
align-items: center;
}
.terminal-card > div {
padding: 1.5em;
}
/* Editor Section */
.editor-controls {
display: flex;
gap: 0.5em;
}
.editor-container {
height: 300px;
overflow: hidden;
}
#c4a-editor {
width: 100%;
height: 100%;
font-family: var(--mono-font-stack);
font-size: var(--global-code-font-size);
background-color: var(--code-bg-color);
color: var(--font-color);
border: none;
padding: 1em;
resize: none;
}
/* JS Output */
.js-output-container {
max-height: 300px;
overflow-y: auto;
}
.js-output-container pre {
margin: 0;
padding: 1em;
background-color: var(--code-bg-color);
}
.js-output-container code {
font-family: var(--mono-font-stack);
font-size: var(--global-code-font-size);
color: var(--font-color);
white-space: pre-wrap;
}
/* Console Output */
.console-output {
font-family: var(--mono-font-stack);
font-size: var(--global-code-font-size);
max-height: 200px;
overflow-y: auto;
padding: 1em;
}
.console-line {
margin-bottom: 0.5em;
}
.console-prompt {
color: var(--primary-color);
margin-right: 0.5em;
}
.console-text {
color: var(--font-color);
}
.console-error {
color: var(--error-color);
}
.console-success {
color: var(--primary-color);
}
/* Playground */
.playground-container {
height: 600px;
background-color: #fff;
border: 1px solid var(--progress-bar-background);
}
#playground-frame {
width: 100%;
height: 100%;
border: none;
}
/* Execution Progress */
.execution-progress {
padding: 1em;
}
.progress-item {
display: flex;
align-items: center;
gap: 0.8em;
margin-bottom: 0.8em;
color: var(--secondary-color);
}
.progress-item.active {
color: var(--primary-color);
}
.progress-item.completed {
color: var(--tertiary-color);
}
.progress-item.error {
color: var(--error-color);
}
.progress-icon {
font-size: 1.2em;
}
/* Buttons */
.btn {
background-color: var(--primary-color);
color: var(--background-color);
border: none;
padding: 0.5em 1em;
font-family: var(--font-stack);
font-size: 0.9em;
cursor: pointer;
transition: all 0.2s;
}
.btn:hover {
background-color: var(--progress-bar-fill);
}
.btn-sm {
padding: 0.3em 0.8em;
font-size: 0.85em;
}
.btn-ghost {
background-color: transparent;
color: var(--secondary-color);
border: 1px solid var(--progress-bar-background);
}
.btn-ghost:hover {
background-color: var(--progress-bar-background);
color: var(--font-color);
}
/* Scrollbars */
::-webkit-scrollbar {
width: 8px;
height: 8px;
}
::-webkit-scrollbar-track {
background: var(--block-background-color);
}
::-webkit-scrollbar-thumb {
background: var(--progress-bar-background);
}
::-webkit-scrollbar-thumb:hover {
background: var(--secondary-color);
}
/* CodeMirror Theme Override */
.CodeMirror {
font-family: var(--mono-font-stack) !important;
font-size: var(--global-code-font-size) !important;
background-color: var(--code-bg-color) !important;
color: var(--font-color) !important;
height: 100% !important;
}
.CodeMirror-gutters {
background-color: var(--progress-bar-background) !important;
border-right: 1px solid var(--progress-bar-background) !important;
}
/* Responsive */
@media (max-width: 1200px) {
.tutorial-grid {
grid-template-columns: 1fr;
}
.playground-section {
order: -1;
}
}
/* Links */
a {
color: var(--primary-color);
text-decoration: none;
}
a:hover {
text-decoration: underline;
}
/* Lists */
ul, ol {
padding-left: 2em;
}
li {
margin-bottom: 0.5em;
}
/* Code */
code {
background-color: var(--code-bg-color);
padding: 0.2em 0.4em;
font-family: var(--mono-font-stack);
font-size: 0.9em;
}
/* Headings */
h1, h2, h3, h4, h5, h6 {
font-weight: 700;
margin-top: 1.5em;
margin-bottom: 0.8em;
}
h3 {
color: var(--primary-color);
font-size: 1.1em;
}
/* Tutorial Panel */
.tutorial-panel {
position: absolute;
top: 60px;
right: 20px;
width: 380px;
background: #1a1a1b;
border: 1px solid #2a2a2c;
border-radius: 8px;
box-shadow: 0 8px 32px rgba(0, 0, 0, 0.5);
z-index: 1000;
transition: all 0.3s ease;
}
.tutorial-panel.hidden {
display: none;
}
.tutorial-header {
display: flex;
justify-content: space-between;
align-items: center;
padding: 16px 20px;
border-bottom: 1px solid #2a2a2c;
}
.tutorial-header h3 {
margin: 0;
color: #0fbbaa;
font-size: 18px;
}
.close-btn {
background: none;
border: none;
color: #8b8b8d;
font-size: 24px;
cursor: pointer;
padding: 0;
width: 30px;
height: 30px;
display: flex;
align-items: center;
justify-content: center;
border-radius: 4px;
transition: all 0.2s;
}
.close-btn:hover {
background: #2a2a2c;
color: #e0e0e0;
}
.tutorial-content {
padding: 20px;
}
.tutorial-content p {
margin: 0 0 16px 0;
color: #e0e0e0;
line-height: 1.6;
}
.tutorial-progress {
margin-top: 16px;
}
.tutorial-progress span {
display: block;
margin-bottom: 8px;
color: #8b8b8d;
font-size: 12px;
text-transform: uppercase;
}
.progress-bar {
height: 4px;
background: #2a2a2c;
border-radius: 2px;
overflow: hidden;
}
.progress-fill {
height: 100%;
background: #0fbbaa;
transition: width 0.3s ease;
}
.tutorial-actions {
display: flex;
gap: 12px;
padding: 0 20px 20px;
}
.tutorial-btn {
flex: 1;
padding: 10px 16px;
background: #2a2a2c;
color: #e0e0e0;
border: 1px solid #3a3a3c;
border-radius: 6px;
font-size: 14px;
cursor: pointer;
transition: all 0.2s;
}
.tutorial-btn:hover:not(:disabled) {
background: #3a3a3c;
transform: translateY(-1px);
}
.tutorial-btn:disabled {
opacity: 0.5;
cursor: not-allowed;
}
.tutorial-btn.primary {
background: #0fbbaa;
color: #070708;
border-color: #0fbbaa;
}
.tutorial-btn.primary:hover {
background: #0da89a;
border-color: #0da89a;
}
/* Tutorial Highlights */
.tutorial-highlight {
position: relative;
animation: pulse 2s infinite;
}
@keyframes pulse {
0% {
box-shadow: 0 0 0 0 rgba(15, 187, 170, 0.4);
}
50% {
box-shadow: 0 0 0 10px rgba(15, 187, 170, 0);
}
100% {
box-shadow: 0 0 0 0 rgba(15, 187, 170, 0);
}
}
.editor-card {
position: relative;
}

View File

@@ -0,0 +1,21 @@
# Demo: Login Flow with Blockly
# This script can be created visually using Blockly blocks
GO https://example.com/login
WAIT `#login-form` 5
# Check if already logged in
IF (EXISTS `.user-avatar`) THEN GO https://example.com/dashboard
# Fill login form
CLICK `#email`
TYPE "demo@example.com"
CLICK `#password`
TYPE "password123"
# Submit form
CLICK `button[type="submit"]`
WAIT `.dashboard` 10
# Success message
EVAL `console.log('Login successful!')`

View File

@@ -0,0 +1,205 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>C4A-Script Interactive Tutorial | Crawl4AI</title>
<link rel="stylesheet" href="assets/app.css">
<link rel="stylesheet" href="assets/blockly-theme.css">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/codemirror/6.65.7/codemirror.min.css">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/codemirror/6.65.7/theme/material-darker.min.css">
</head>
<body>
<!-- Tutorial Intro Modal -->
<div id="tutorial-intro" class="tutorial-intro-modal">
<div class="intro-content">
<h2>Welcome to C4A-Script Tutorial!</h2>
<p>C4A-Script is a simple language for web automation. This interactive tutorial will teach you:</p>
<ul>
<li>How to handle popups and banners</li>
<li>Form filling and navigation</li>
<li>Advanced automation techniques</li>
</ul>
<div class="intro-actions">
<button id="start-tutorial-btn" class="intro-btn primary">Start Tutorial</button>
<button id="skip-tutorial-btn" class="intro-btn">Skip</button>
</div>
</div>
</div>
<!-- Event Editor Modal -->
<div id="event-editor-overlay" class="modal-overlay hidden"></div>
<div id="event-editor-modal" class="event-editor-modal hidden">
<h4>Edit Event</h4>
<div class="editor-field">
<label>Command Type</label>
<select id="edit-command-type" disabled>
<option value="CLICK">CLICK</option>
<option value="DOUBLE_CLICK">DOUBLE_CLICK</option>
<option value="RIGHT_CLICK">RIGHT_CLICK</option>
<option value="TYPE">TYPE</option>
<option value="SET">SET</option>
<option value="SCROLL">SCROLL</option>
<option value="WAIT">WAIT</option>
</select>
</div>
<div id="edit-selector-field" class="editor-field">
<label>Selector</label>
<input type="text" id="edit-selector" placeholder=".class or #id">
</div>
<div id="edit-value-field" class="editor-field">
<label>Value</label>
<input type="text" id="edit-value" placeholder="Text or number">
</div>
<div id="edit-direction-field" class="editor-field hidden">
<label>Direction</label>
<select id="edit-direction">
<option value="UP">UP</option>
<option value="DOWN">DOWN</option>
<option value="LEFT">LEFT</option>
<option value="RIGHT">RIGHT</option>
</select>
</div>
<div class="editor-actions">
<button id="edit-cancel" class="mini-btn">Cancel</button>
<button id="edit-save" class="mini-btn primary">Save</button>
</div>
</div>
<!-- Main App Layout -->
<div class="app-container">
<!-- Left Panel: Editor -->
<div class="editor-panel">
<div class="panel-header">
<h2>C4A-Script Editor</h2>
<div class="header-actions">
<button id="tutorial-btn" class="action-btn" title="Tutorial">
<span class="icon">📚</span>
</button>
<button id="examples-btn" class="action-btn" title="Examples">
<span class="icon">📋</span>
</button>
<button id="clear-btn" class="action-btn" title="Clear">
<span class="icon">🗑</span>
</button>
<button id="run-btn" class="action-btn primary">
<span class="icon"></span>Run
</button>
<button id="record-btn" class="action-btn record">
<span class="icon"></span>Record
</button>
<button id="timeline-btn" class="action-btn timeline hidden" title="View Timeline">
<span class="icon">📊</span>
</button>
</div>
</div>
<div class="editor-container">
<div id="editor-view" class="editor-wrapper">
<textarea id="c4a-editor" placeholder="# Write your C4A script here..."></textarea>
</div>
<!-- Recording Timeline -->
<div id="timeline-view" class="recording-timeline hidden">
<div class="timeline-header">
<h3>Recording Timeline</h3>
<div class="timeline-actions">
<button id="back-to-editor" class="mini-btn">← Back</button>
<button id="select-all-events" class="mini-btn">Select All</button>
<button id="clear-events" class="mini-btn">Clear</button>
<button id="generate-script" class="mini-btn primary">Generate Script</button>
</div>
</div>
<div id="timeline-events" class="timeline-events">
<!-- Events will be added here dynamically -->
</div>
</div>
</div>
<!-- Bottom: Output Tabs -->
<div class="output-section">
<div class="tabs">
<button class="tab active" data-tab="console">Console</button>
<button class="tab" data-tab="javascript">Generated JS</button>
</div>
<div class="tab-content">
<div id="console-tab" class="tab-pane active">
<div id="console-output" class="console">
<div class="console-line">
<span class="console-prompt">$</span>
<span class="console-text">Ready to run C4A scripts...</span>
</div>
</div>
</div>
<div id="javascript-tab" class="tab-pane">
<div class="js-output-header">
<div class="js-actions">
<button id="copy-js-btn" class="mini-btn" title="Copy">
<span>📋</span>
</button>
<button id="edit-js-btn" class="mini-btn" title="Edit">
<span>✏️</span>
</button>
</div>
</div>
<pre id="js-output" class="js-output">// JavaScript will appear here...</pre>
</div>
</div>
</div>
</div>
<!-- Right Panel: Playground -->
<div class="playground-panel">
<div class="panel-header">
<h2>Playground</h2>
<div class="header-actions">
<button id="reset-playground" class="action-btn" title="Reset">
<span class="icon">🔄</span>
</button>
<button id="fullscreen-btn" class="action-btn" title="Fullscreen">
<span class="icon"></span>
</button>
</div>
</div>
<div class="playground-wrapper">
<iframe id="playground-frame" src="playground/" title="Playground"></iframe>
</div>
</div>
</div>
<!-- Tutorial Navigation Bar -->
<div id="tutorial-nav" class="tutorial-nav hidden">
<div class="tutorial-nav-content">
<div class="tutorial-left">
<div class="tutorial-step-title">
<span id="tutorial-step-info">Step 1 of 9</span>
<span id="tutorial-title">Welcome</span>
</div>
<p id="tutorial-description" class="tutorial-description">Let's start by waiting for the page to load.</p>
</div>
<div class="tutorial-right">
<div class="tutorial-controls">
<button id="tutorial-prev" class="nav-btn" disabled>← Previous</button>
<button id="tutorial-next" class="nav-btn primary">Next →</button>
</div>
<button id="tutorial-exit" class="exit-btn" title="Exit Tutorial">×</button>
</div>
</div>
<div class="tutorial-progress-bar">
<div id="tutorial-progress-fill" class="progress-fill"></div>
</div>
</div>
<!-- Scripts -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/codemirror/6.65.7/codemirror.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/codemirror/6.65.7/mode/javascript/javascript.min.js"></script>
<!-- Blockly -->
<script src="https://unpkg.com/blockly/blockly.min.js"></script>
<script src="assets/c4a-blocks.js"></script>
<script src="assets/c4a-generator.js"></script>
<script src="assets/blockly-manager.js"></script>
<script src="assets/app.js"></script>
</body>
</html>

View File

@@ -0,0 +1,604 @@
// Playground App JavaScript
class PlaygroundApp {
constructor() {
this.isLoggedIn = false;
this.currentSection = 'home';
this.productsLoaded = 0;
this.maxProducts = 100;
this.tableRowsLoaded = 10;
this.inspectorMode = false;
this.tooltip = null;
this.init();
}
init() {
this.setupCookieBanner();
this.setupNewsletterPopup();
this.setupNavigation();
this.setupAuth();
this.setupProductCatalog();
this.setupForms();
this.setupTabs();
this.setupDataTable();
this.setupInspector();
this.loadInitialData();
}
// Cookie Banner
setupCookieBanner() {
const banner = document.getElementById('cookie-banner');
const acceptBtn = banner.querySelector('.accept');
const declineBtn = banner.querySelector('.decline');
acceptBtn.addEventListener('click', () => {
banner.style.display = 'none';
console.log('✅ Cookies accepted');
});
declineBtn.addEventListener('click', () => {
banner.style.display = 'none';
console.log('❌ Cookies declined');
});
}
// Newsletter Popup
setupNewsletterPopup() {
const popup = document.getElementById('newsletter-popup');
const closeBtn = popup.querySelector('.close');
const subscribeBtn = popup.querySelector('.subscribe');
// Show popup after 3 seconds
setTimeout(() => {
popup.style.display = 'flex';
}, 3000);
closeBtn.addEventListener('click', () => {
popup.style.display = 'none';
});
subscribeBtn.addEventListener('click', () => {
const email = popup.querySelector('input').value;
if (email) {
console.log(`📧 Subscribed: ${email}`);
popup.style.display = 'none';
}
});
// Close on outside click
popup.addEventListener('click', (e) => {
if (e.target === popup) {
popup.style.display = 'none';
}
});
}
// Navigation
setupNavigation() {
const navLinks = document.querySelectorAll('.nav-link');
const sections = document.querySelectorAll('.section');
navLinks.forEach(link => {
link.addEventListener('click', (e) => {
e.preventDefault();
const targetId = link.getAttribute('href').substring(1);
// Update active states
navLinks.forEach(l => l.classList.remove('active'));
link.classList.add('active');
// Show target section
sections.forEach(s => s.classList.remove('active'));
const targetSection = document.getElementById(targetId);
if (targetSection) {
targetSection.classList.add('active');
this.currentSection = targetId;
// Load content for specific sections
this.loadSectionContent(targetId);
}
});
});
// Start tutorial button
const startBtn = document.getElementById('start-tutorial');
if (startBtn) {
startBtn.addEventListener('click', () => {
console.log('🚀 Tutorial started!');
alert('Tutorial started! Check the console for progress.');
});
}
}
// Authentication
setupAuth() {
const loginBtn = document.getElementById('login-btn');
const logoutBtn = document.getElementById('logout-btn');
const loginModal = document.getElementById('login-modal');
const loginForm = document.getElementById('login-form');
const closeBtn = loginModal.querySelector('.close');
loginBtn.addEventListener('click', () => {
loginModal.style.display = 'flex';
});
closeBtn.addEventListener('click', () => {
loginModal.style.display = 'none';
});
loginForm.addEventListener('submit', (e) => {
e.preventDefault();
const email = document.getElementById('email').value;
const password = document.getElementById('password').value;
const rememberMe = document.getElementById('remember-me').checked;
const messageEl = document.getElementById('login-message');
// Simple validation
if (email === 'demo@example.com' && password === 'demo123') {
this.isLoggedIn = true;
messageEl.textContent = '✅ Login successful!';
messageEl.className = 'form-message success';
setTimeout(() => {
loginModal.style.display = 'none';
document.getElementById('login-btn').style.display = 'none';
document.getElementById('user-info').style.display = 'flex';
document.getElementById('username-display').textContent = 'Demo User';
console.log(`✅ Logged in${rememberMe ? ' (remembered)' : ''}`);
}, 1000);
} else {
messageEl.textContent = '❌ Invalid credentials. Try demo@example.com / demo123';
messageEl.className = 'form-message error';
}
});
logoutBtn.addEventListener('click', () => {
this.isLoggedIn = false;
document.getElementById('login-btn').style.display = 'block';
document.getElementById('user-info').style.display = 'none';
console.log('👋 Logged out');
});
// Close modal on outside click
loginModal.addEventListener('click', (e) => {
if (e.target === loginModal) {
loginModal.style.display = 'none';
}
});
}
// Product Catalog
setupProductCatalog() {
// View toggle
const infiniteBtn = document.getElementById('infinite-scroll-btn');
const paginationBtn = document.getElementById('pagination-btn');
const infiniteView = document.getElementById('infinite-scroll-view');
const paginationView = document.getElementById('pagination-view');
infiniteBtn.addEventListener('click', () => {
infiniteBtn.classList.add('active');
paginationBtn.classList.remove('active');
infiniteView.style.display = 'block';
paginationView.style.display = 'none';
this.setupInfiniteScroll();
});
paginationBtn.addEventListener('click', () => {
paginationBtn.classList.add('active');
infiniteBtn.classList.remove('active');
paginationView.style.display = 'block';
infiniteView.style.display = 'none';
});
// Load more button
const loadMoreBtn = paginationView.querySelector('.load-more');
loadMoreBtn.addEventListener('click', () => {
this.loadMoreProducts();
});
// Collapsible filters
const collapsibles = document.querySelectorAll('.collapsible');
collapsibles.forEach(header => {
header.addEventListener('click', () => {
const content = header.nextElementSibling;
const toggle = header.querySelector('.toggle');
content.style.display = content.style.display === 'none' ? 'block' : 'none';
toggle.textContent = content.style.display === 'none' ? '▶' : '▼';
});
});
}
setupInfiniteScroll() {
const container = document.querySelector('.products-container');
const loadingIndicator = document.getElementById('loading-indicator');
container.addEventListener('scroll', () => {
if (container.scrollTop + container.clientHeight >= container.scrollHeight - 100) {
if (this.productsLoaded < this.maxProducts) {
loadingIndicator.style.display = 'block';
setTimeout(() => {
this.loadMoreProducts();
loadingIndicator.style.display = 'none';
}, 1000);
}
}
});
}
loadMoreProducts() {
const grid = document.getElementById('product-grid');
const batch = 10;
for (let i = 0; i < batch && this.productsLoaded < this.maxProducts; i++) {
const product = this.createProductCard(this.productsLoaded + 1);
grid.appendChild(product);
this.productsLoaded++;
}
console.log(`📦 Loaded ${batch} more products. Total: ${this.productsLoaded}`);
}
createProductCard(id) {
const card = document.createElement('div');
card.className = 'product-card';
card.innerHTML = `
<div class="product-image">📦</div>
<div class="product-name">Product ${id}</div>
<div class="product-price">$${(Math.random() * 100 + 10).toFixed(2)}</div>
<button class="btn btn-sm">Quick View</button>
`;
// Quick view functionality
const quickViewBtn = card.querySelector('button');
quickViewBtn.addEventListener('click', () => {
alert(`Quick view for Product ${id}`);
});
return card;
}
// Forms
setupForms() {
// Contact Form
const contactForm = document.getElementById('contact-form');
const subjectSelect = document.getElementById('contact-subject');
const departmentGroup = document.getElementById('department-group');
const departmentSelect = document.getElementById('department');
subjectSelect.addEventListener('change', () => {
if (subjectSelect.value === 'support') {
departmentGroup.style.display = 'block';
departmentSelect.innerHTML = `
<option value="">Select department</option>
<option value="technical">Technical Support</option>
<option value="billing">Billing Support</option>
<option value="general">General Support</option>
`;
} else {
departmentGroup.style.display = 'none';
}
});
contactForm.addEventListener('submit', (e) => {
e.preventDefault();
const messageDisplay = document.getElementById('contact-message-display');
messageDisplay.textContent = '✅ Message sent successfully!';
messageDisplay.className = 'form-message success';
console.log('📧 Contact form submitted');
});
// Multi-step Form
const surveyForm = document.getElementById('survey-form');
const steps = surveyForm.querySelectorAll('.form-step');
const progressFill = document.getElementById('progress-fill');
let currentStep = 1;
surveyForm.addEventListener('click', (e) => {
if (e.target.classList.contains('next-step')) {
if (currentStep < 3) {
steps[currentStep - 1].style.display = 'none';
currentStep++;
steps[currentStep - 1].style.display = 'block';
progressFill.style.width = `${(currentStep / 3) * 100}%`;
}
} else if (e.target.classList.contains('prev-step')) {
if (currentStep > 1) {
steps[currentStep - 1].style.display = 'none';
currentStep--;
steps[currentStep - 1].style.display = 'block';
progressFill.style.width = `${(currentStep / 3) * 100}%`;
}
}
});
surveyForm.addEventListener('submit', (e) => {
e.preventDefault();
document.getElementById('survey-success').style.display = 'block';
console.log('📋 Survey submitted successfully!');
});
}
// Tabs
setupTabs() {
const tabBtns = document.querySelectorAll('.tab-btn');
const tabPanes = document.querySelectorAll('.tab-pane');
tabBtns.forEach(btn => {
btn.addEventListener('click', () => {
const targetTab = btn.getAttribute('data-tab');
// Update active states
tabBtns.forEach(b => b.classList.remove('active'));
btn.classList.add('active');
// Show target pane
tabPanes.forEach(pane => {
pane.style.display = pane.id === targetTab ? 'block' : 'none';
});
});
});
// Show more functionality
const showMoreBtn = document.querySelector('.show-more');
const hiddenText = document.querySelector('.hidden-text');
if (showMoreBtn) {
showMoreBtn.addEventListener('click', () => {
if (hiddenText.style.display === 'none') {
hiddenText.style.display = 'block';
showMoreBtn.textContent = 'Show Less';
} else {
hiddenText.style.display = 'none';
showMoreBtn.textContent = 'Show More';
}
});
}
// Load comments
const loadCommentsBtn = document.querySelector('.load-comments');
const commentsSection = document.querySelector('.comments-section');
if (loadCommentsBtn) {
loadCommentsBtn.addEventListener('click', () => {
commentsSection.style.display = 'block';
commentsSection.innerHTML = `
<div class="comment">
<div class="comment-author">John Doe</div>
<div class="comment-text">Great product! Highly recommended.</div>
</div>
<div class="comment">
<div class="comment-author">Jane Smith</div>
<div class="comment-text">Excellent quality and fast shipping.</div>
</div>
`;
loadCommentsBtn.style.display = 'none';
console.log('💬 Comments loaded');
});
}
}
// Data Table
setupDataTable() {
const loadMoreBtn = document.querySelector('.load-more-rows');
const searchInput = document.querySelector('.search-input');
const exportBtn = document.getElementById('export-btn');
const sortableHeaders = document.querySelectorAll('.sortable');
// Load more rows
loadMoreBtn.addEventListener('click', () => {
this.loadMoreTableRows();
});
// Search functionality
searchInput.addEventListener('input', (e) => {
const searchTerm = e.target.value.toLowerCase();
const rows = document.querySelectorAll('#table-body tr');
rows.forEach(row => {
const text = row.textContent.toLowerCase();
row.style.display = text.includes(searchTerm) ? '' : 'none';
});
});
// Export functionality
exportBtn.addEventListener('click', () => {
console.log('📊 Exporting table data...');
alert('Table data exported! (Check console)');
});
// Sorting
sortableHeaders.forEach(header => {
header.addEventListener('click', () => {
console.log(`🔄 Sorting by ${header.getAttribute('data-sort')}`);
});
});
}
loadMoreTableRows() {
const tbody = document.getElementById('table-body');
const batch = 10;
for (let i = 0; i < batch; i++) {
const row = document.createElement('tr');
const id = this.tableRowsLoaded + i + 1;
row.innerHTML = `
<td>User ${id}</td>
<td>user${id}@example.com</td>
<td>${new Date().toLocaleDateString()}</td>
<td><button class="btn btn-sm">Edit</button></td>
`;
tbody.appendChild(row);
}
this.tableRowsLoaded += batch;
console.log(`📄 Loaded ${batch} more rows. Total: ${this.tableRowsLoaded}`);
}
// Load initial data
loadInitialData() {
// Load initial products
this.loadMoreProducts();
// Load initial table rows
this.loadMoreTableRows();
}
// Load content when navigating to sections
loadSectionContent(sectionId) {
switch(sectionId) {
case 'catalog':
// Ensure products are loaded in catalog
if (this.productsLoaded === 0) {
this.loadMoreProducts();
}
break;
case 'data-tables':
// Ensure table rows are loaded
if (this.tableRowsLoaded === 0) {
this.loadMoreTableRows();
}
break;
case 'forms':
// Forms are already set up
break;
case 'tabs':
// Tabs content is static
break;
}
}
// Inspector Mode
setupInspector() {
const inspectorBtn = document.getElementById('inspector-btn');
// Create tooltip element
this.tooltip = document.createElement('div');
this.tooltip.className = 'inspector-tooltip';
this.tooltip.style.cssText = `
position: fixed;
background: rgba(0, 0, 0, 0.9);
color: white;
padding: 8px 12px;
border-radius: 4px;
font-size: 12px;
font-family: monospace;
pointer-events: none;
z-index: 10000;
display: none;
max-width: 300px;
`;
document.body.appendChild(this.tooltip);
inspectorBtn.addEventListener('click', () => {
this.toggleInspector();
});
// Add mouse event listeners
document.addEventListener('mousemove', this.handleMouseMove.bind(this));
document.addEventListener('mouseout', this.handleMouseOut.bind(this));
}
toggleInspector() {
this.inspectorMode = !this.inspectorMode;
const inspectorBtn = document.getElementById('inspector-btn');
if (this.inspectorMode) {
inspectorBtn.classList.add('active');
inspectorBtn.style.background = '#0fbbaa';
document.body.style.cursor = 'crosshair';
} else {
inspectorBtn.classList.remove('active');
inspectorBtn.style.background = '';
document.body.style.cursor = '';
this.tooltip.style.display = 'none';
this.removeHighlight();
}
}
handleMouseMove(e) {
if (!this.inspectorMode) return;
const element = e.target;
if (element === this.tooltip) return;
// Highlight element
this.highlightElement(element);
// Show tooltip with element info
const info = this.getElementInfo(element);
this.tooltip.innerHTML = info;
this.tooltip.style.display = 'block';
// Position tooltip
const x = e.clientX + 15;
const y = e.clientY + 15;
// Adjust position if tooltip would go off screen
const rect = this.tooltip.getBoundingClientRect();
const adjustedX = x + rect.width > window.innerWidth ? x - rect.width - 30 : x;
const adjustedY = y + rect.height > window.innerHeight ? y - rect.height - 30 : y;
this.tooltip.style.left = adjustedX + 'px';
this.tooltip.style.top = adjustedY + 'px';
}
handleMouseOut(e) {
if (!this.inspectorMode) return;
if (e.target === document.body) {
this.removeHighlight();
this.tooltip.style.display = 'none';
}
}
highlightElement(element) {
this.removeHighlight();
element.style.outline = '2px solid #0fbbaa';
element.style.outlineOffset = '1px';
element.setAttribute('data-inspector-highlighted', 'true');
}
removeHighlight() {
const highlighted = document.querySelector('[data-inspector-highlighted]');
if (highlighted) {
highlighted.style.outline = '';
highlighted.style.outlineOffset = '';
highlighted.removeAttribute('data-inspector-highlighted');
}
}
getElementInfo(element) {
const tagName = element.tagName.toLowerCase();
const id = element.id ? `#${element.id}` : '';
const classes = element.className ?
`.${element.className.split(' ').filter(c => c).join('.')}` : '';
let selector = tagName;
if (id) {
selector = id;
} else if (classes) {
selector = `${tagName}${classes}`;
}
// Build info HTML
let info = `<strong>${selector}</strong>`;
// Add additional attributes
const attrs = [];
if (element.name) attrs.push(`name="${element.name}"`);
if (element.type) attrs.push(`type="${element.type}"`);
if (element.href) attrs.push(`href="${element.href}"`);
if (element.value && element.tagName === 'INPUT') attrs.push(`value="${element.value}"`);
if (attrs.length > 0) {
info += `<br><span style="color: #888;">${attrs.join(' ')}</span>`;
}
return info;
}
}
// Initialize app when DOM is ready
document.addEventListener('DOMContentLoaded', () => {
window.playgroundApp = new PlaygroundApp();
console.log('🎮 Playground app initialized!');
});

View File

@@ -0,0 +1,328 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>C4A-Script Playground</title>
<link rel="stylesheet" href="styles.css">
</head>
<body>
<!-- Cookie Banner -->
<div class="cookie-banner" id="cookie-banner">
<div class="cookie-content">
<p>🍪 We use cookies to enhance your experience. By continuing, you agree to our cookie policy.</p>
<div class="cookie-actions">
<button class="btn accept">Accept All</button>
<button class="btn btn-secondary decline">Decline</button>
</div>
</div>
</div>
<!-- Newsletter Popup (appears after 3 seconds) -->
<div class="modal" id="newsletter-popup" style="display: none;">
<div class="modal-content">
<span class="close">&times;</span>
<h2>📬 Subscribe to Our Newsletter</h2>
<p>Get the latest updates on web automation!</p>
<input type="email" placeholder="Enter your email" class="input">
<button class="btn subscribe">Subscribe</button>
</div>
</div>
<!-- Header -->
<header class="site-header">
<nav class="nav-menu">
<a href="#home" class="nav-link active">Home</a>
<a href="#catalog" class="nav-link" id="catalog-link">Products</a>
<a href="#forms" class="nav-link">Forms</a>
<a href="#data-tables" class="nav-link">Data Tables</a>
<div class="dropdown">
<a href="#" class="nav-link dropdown-toggle">More ▼</a>
<div class="dropdown-content">
<a href="#tabs">Tabs Demo</a>
<a href="#accordion">FAQ</a>
<a href="#gallery">Gallery</a>
</div>
</div>
</nav>
<div class="auth-section">
<button class="btn btn-sm" id="inspector-btn" title="Toggle Inspector">🔍</button>
<button class="btn btn-sm" id="login-btn">Login</button>
<div class="user-info" id="user-info" style="display: none;">
<span class="user-avatar">👤</span>
<span class="welcome-message">Welcome, <span id="username-display">User</span>!</span>
<button class="btn btn-sm btn-secondary" id="logout-btn">Logout</button>
</div>
</div>
</header>
<!-- Main Content -->
<main class="main-content">
<!-- Home Section -->
<section id="home" class="section active">
<h1>Welcome to C4A-Script Playground</h1>
<p>This is an interactive demo for testing C4A-Script commands. Each section contains different challenges for web automation.</p>
<button class="btn btn-primary" id="start-tutorial">Start Tutorial</button>
<div class="feature-grid">
<div class="feature-card">
<h3>🔐 Authentication</h3>
<p>Test login forms and user sessions</p>
</div>
<div class="feature-card">
<h3>📜 Dynamic Content</h3>
<p>Infinite scroll and pagination</p>
</div>
<div class="feature-card">
<h3>📝 Forms</h3>
<p>Complex form interactions</p>
</div>
<div class="feature-card">
<h3>📊 Data Tables</h3>
<p>Sortable and filterable data</p>
</div>
</div>
</section>
<!-- Login Modal -->
<div class="modal" id="login-modal" style="display: none;">
<div class="modal-content login-form">
<span class="close">&times;</span>
<h2>Login</h2>
<form id="login-form">
<div class="form-group">
<label>Email</label>
<input type="email" id="email" class="input" placeholder="demo@example.com">
</div>
<div class="form-group">
<label>Password</label>
<input type="password" id="password" class="input" placeholder="demo123">
</div>
<div class="form-group">
<label class="checkbox-label">
<input type="checkbox" id="remember-me">
Remember me
</label>
</div>
<button type="submit" class="btn btn-primary">Login</button>
<div class="form-message" id="login-message"></div>
</form>
</div>
</div>
<!-- Product Catalog Section -->
<section id="catalog" class="section">
<h1>Product Catalog</h1>
<div class="view-toggle">
<button class="btn btn-sm active" id="infinite-scroll-btn">Infinite Scroll</button>
<button class="btn btn-sm" id="pagination-btn">Pagination</button>
</div>
<!-- Filters Sidebar -->
<div class="catalog-layout">
<aside class="filters-sidebar">
<h3>Filters</h3>
<div class="filter-group">
<h4 class="collapsible">Category <span class="toggle"></span></h4>
<div class="filter-content">
<label><input type="checkbox"> Electronics</label>
<label><input type="checkbox"> Clothing</label>
<label><input type="checkbox"> Books</label>
</div>
</div>
<div class="filter-group">
<h4 class="collapsible">Price Range <span class="toggle"></span></h4>
<div class="filter-content">
<input type="range" min="0" max="1000" value="500">
<span>$0 - $500</span>
</div>
</div>
</aside>
<!-- Products Grid -->
<div class="products-container">
<div class="product-grid" id="product-grid">
<!-- Products will be loaded here -->
</div>
<!-- Infinite Scroll View -->
<div id="infinite-scroll-view" class="view-mode">
<div class="loading-indicator" id="loading-indicator" style="display: none;">
<div class="spinner"></div>
<p>Loading more products...</p>
</div>
</div>
<!-- Pagination View -->
<div id="pagination-view" class="view-mode" style="display: none;">
<button class="btn load-more">Load More</button>
<div class="pagination">
<button class="page-btn">1</button>
<button class="page-btn">2</button>
<button class="page-btn">3</button>
</div>
</div>
</div>
</div>
</section>
<!-- Forms Section -->
<section id="forms" class="section">
<h1>Form Examples</h1>
<!-- Contact Form -->
<div class="form-card">
<h2>Contact Form</h2>
<form id="contact-form">
<div class="form-group">
<label>Name</label>
<input type="text" class="input" id="contact-name">
</div>
<div class="form-group">
<label>Email</label>
<input type="email" class="input" id="contact-email">
</div>
<div class="form-group">
<label>Subject</label>
<select class="input" id="contact-subject">
<option value="">Select a subject</option>
<option value="support">Support</option>
<option value="sales">Sales</option>
<option value="feedback">Feedback</option>
</select>
</div>
<div class="form-group" id="department-group" style="display: none;">
<label>Department</label>
<select class="input" id="department">
<option value="">Select department</option>
</select>
</div>
<div class="form-group">
<label>Message</label>
<textarea class="input" id="contact-message" rows="4"></textarea>
</div>
<button type="submit" class="btn btn-primary">Send Message</button>
<div class="form-message" id="contact-message-display"></div>
</form>
</div>
<!-- Multi-step Form -->
<div class="form-card">
<h2>Multi-step Survey</h2>
<div class="progress-bar">
<div class="progress-fill" id="progress-fill" style="width: 33%"></div>
</div>
<form id="survey-form">
<!-- Step 1 -->
<div class="form-step active" data-step="1">
<h3>Step 1: Basic Information</h3>
<div class="form-group">
<label>Full Name</label>
<input type="text" class="input" id="full-name">
</div>
<div class="form-group">
<label>Email</label>
<input type="email" class="input" id="survey-email">
</div>
<button type="button" class="btn next-step">Next</button>
</div>
<!-- Step 2 -->
<div class="form-step" data-step="2" style="display: none;">
<h3>Step 2: Preferences</h3>
<div class="form-group">
<label>Interests (select multiple)</label>
<select multiple class="input" id="interests">
<option value="tech">Technology</option>
<option value="sports">Sports</option>
<option value="music">Music</option>
<option value="travel">Travel</option>
</select>
</div>
<button type="button" class="btn prev-step">Previous</button>
<button type="button" class="btn next-step">Next</button>
</div>
<!-- Step 3 -->
<div class="form-step" data-step="3" style="display: none;">
<h3>Step 3: Confirmation</h3>
<p>Please review your information and submit.</p>
<button type="button" class="btn prev-step">Previous</button>
<button type="submit" class="btn btn-primary" id="submit-survey">Submit Survey</button>
</div>
</form>
<div class="form-message success-message" id="survey-success" style="display: none;">
✅ Survey submitted successfully!
</div>
</div>
</section>
<!-- Tabs Section -->
<section id="tabs" class="section">
<h1>Tabs Demo</h1>
<div class="tabs-container">
<div class="tabs-header">
<button class="tab-btn active" data-tab="description">Description</button>
<button class="tab-btn" data-tab="reviews">Reviews</button>
<button class="tab-btn" data-tab="specs">Specifications</button>
</div>
<div class="tabs-content">
<div class="tab-pane active" id="description">
<h3>Product Description</h3>
<p>This is a detailed description of the product...</p>
<div class="expandable-text">
<p class="text-preview">Lorem ipsum dolor sit amet, consectetur adipiscing elit...</p>
<button class="btn btn-sm show-more">Show More</button>
<div class="hidden-text" style="display: none;">
<p>This is the hidden text that appears when you click "Show More". It contains additional details about the product that weren't visible initially.</p>
</div>
</div>
</div>
<div class="tab-pane" id="reviews" style="display: none;">
<h3>Customer Reviews</h3>
<button class="btn btn-sm load-comments">Load Comments</button>
<div class="comments-section" style="display: none;">
<!-- Comments will be loaded here -->
</div>
</div>
<div class="tab-pane" id="specs" style="display: none;">
<h3>Technical Specifications</h3>
<table class="specs-table">
<tr><td>Model</td><td>XYZ-2000</td></tr>
<tr><td>Weight</td><td>2.5 kg</td></tr>
<tr><td>Dimensions</td><td>30 x 20 x 10 cm</td></tr>
</table>
</div>
</div>
</div>
</section>
<!-- Data Tables Section -->
<section id="data-tables" class="section">
<h1>Data Tables</h1>
<div class="table-controls">
<input type="text" class="input search-input" placeholder="Search...">
<button class="btn btn-sm" id="export-btn">Export</button>
</div>
<table class="data-table" id="data-table">
<thead>
<tr>
<th class="sortable" data-sort="name">Name ↕</th>
<th class="sortable" data-sort="email">Email ↕</th>
<th class="sortable" data-sort="date">Date ↕</th>
<th>Actions</th>
</tr>
</thead>
<tbody id="table-body">
<!-- Table rows will be loaded here -->
</tbody>
</table>
<button class="btn load-more-rows">Load More Rows</button>
</section>
</main>
<script src="app.js"></script>
</body>
</html>

View File

@@ -0,0 +1,627 @@
/* Playground Styles - Modern Web App Theme */
:root {
--primary-color: #0fbbaa;
--secondary-color: #3f3f44;
--background-color: #ffffff;
--text-color: #333333;
--border-color: #e0e0e0;
--error-color: #ff3c74;
--success-color: #0fbbaa;
--warning-color: #ffa500;
}
* {
box-sizing: border-box;
}
body {
margin: 0;
padding: 0;
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
font-size: 16px;
line-height: 1.6;
color: var(--text-color);
background-color: var(--background-color);
}
/* Cookie Banner */
.cookie-banner {
position: fixed;
bottom: 0;
left: 0;
right: 0;
background-color: #2c3e50;
color: white;
padding: 1rem;
z-index: 1000;
box-shadow: 0 -2px 10px rgba(0,0,0,0.1);
}
.cookie-content {
max-width: 1200px;
margin: 0 auto;
display: flex;
align-items: center;
justify-content: space-between;
flex-wrap: wrap;
gap: 1rem;
}
.cookie-actions {
display: flex;
gap: 0.5rem;
}
/* Header */
.site-header {
background-color: #fff;
border-bottom: 1px solid var(--border-color);
padding: 1rem 2rem;
position: sticky;
top: 0;
z-index: 100;
display: flex;
justify-content: space-between;
align-items: center;
}
.nav-menu {
display: flex;
gap: 2rem;
align-items: center;
}
.nav-link {
text-decoration: none;
color: var(--text-color);
font-weight: 500;
transition: color 0.2s;
}
.nav-link:hover,
.nav-link.active {
color: var(--primary-color);
}
/* Dropdown */
.dropdown {
position: relative;
}
.dropdown-content {
display: none;
position: absolute;
background-color: white;
min-width: 160px;
box-shadow: 0 8px 16px rgba(0,0,0,0.1);
z-index: 1;
border-radius: 4px;
top: 100%;
margin-top: 0.5rem;
}
.dropdown:hover .dropdown-content {
display: block;
}
.dropdown-content a {
color: var(--text-color);
padding: 0.75rem 1rem;
text-decoration: none;
display: block;
}
.dropdown-content a:hover {
background-color: #f5f5f5;
}
/* Auth Section */
.auth-section {
display: flex;
align-items: center;
gap: 1rem;
}
.user-info {
display: flex;
align-items: center;
gap: 0.5rem;
}
.user-avatar {
font-size: 1.5rem;
}
/* Main Content */
.main-content {
padding: 2rem;
max-width: 1200px;
margin: 0 auto;
}
.section {
display: none;
}
.section.active {
display: block;
}
/* Buttons */
.btn {
background-color: var(--primary-color);
color: white;
border: none;
padding: 0.5rem 1rem;
border-radius: 4px;
cursor: pointer;
font-size: 1rem;
font-weight: 500;
transition: all 0.2s;
}
.btn:hover {
background-color: #0aa599;
transform: translateY(-1px);
}
.btn-sm {
padding: 0.25rem 0.75rem;
font-size: 0.875rem;
}
.btn-secondary {
background-color: var(--secondary-color);
}
.btn-secondary:hover {
background-color: #333;
}
.btn-primary {
background-color: var(--primary-color);
}
/* Feature Grid */
.feature-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
gap: 1.5rem;
margin-top: 2rem;
}
.feature-card {
background-color: #f8f9fa;
padding: 1.5rem;
border-radius: 8px;
text-align: center;
transition: transform 0.2s;
}
.feature-card:hover {
transform: translateY(-4px);
box-shadow: 0 4px 12px rgba(0,0,0,0.1);
}
.feature-card h3 {
margin-top: 0;
}
/* Modal */
.modal {
position: fixed;
z-index: 1000;
left: 0;
top: 0;
width: 100%;
height: 100%;
background-color: rgba(0,0,0,0.5);
display: flex;
align-items: center;
justify-content: center;
}
.modal-content {
background-color: white;
padding: 2rem;
border-radius: 8px;
max-width: 500px;
width: 90%;
position: relative;
animation: modalFadeIn 0.3s;
}
@keyframes modalFadeIn {
from { opacity: 0; transform: translateY(-20px); }
to { opacity: 1; transform: translateY(0); }
}
.close {
position: absolute;
right: 1rem;
top: 1rem;
font-size: 1.5rem;
cursor: pointer;
color: #999;
}
.close:hover {
color: #333;
}
/* Forms */
.form-group {
margin-bottom: 1rem;
}
.form-group label {
display: block;
margin-bottom: 0.5rem;
font-weight: 500;
}
.input {
width: 100%;
padding: 0.5rem;
border: 1px solid var(--border-color);
border-radius: 4px;
font-size: 1rem;
}
.input:focus {
outline: none;
border-color: var(--primary-color);
}
.checkbox-label {
display: flex;
align-items: center;
gap: 0.5rem;
}
.form-message {
margin-top: 1rem;
padding: 0.75rem;
border-radius: 4px;
display: none;
}
.form-message.error {
background-color: #ffe6e6;
color: var(--error-color);
display: block;
}
.form-message.success {
background-color: #e6fff6;
color: var(--success-color);
display: block;
}
/* Product Catalog */
.view-toggle {
margin-bottom: 1rem;
}
.catalog-layout {
display: grid;
grid-template-columns: 250px 1fr;
gap: 2rem;
}
.filters-sidebar {
background-color: #f8f9fa;
padding: 1rem;
border-radius: 8px;
}
.filter-group {
margin-bottom: 1.5rem;
}
.collapsible {
cursor: pointer;
display: flex;
justify-content: space-between;
align-items: center;
}
.filter-content {
margin-top: 0.5rem;
}
.filter-content label {
display: block;
margin-bottom: 0.5rem;
}
/* Product Grid */
.product-grid {
display: grid;
grid-template-columns: repeat(auto-fill, minmax(200px, 1fr));
gap: 1.5rem;
}
.product-card {
background-color: white;
border: 1px solid var(--border-color);
border-radius: 8px;
padding: 1rem;
text-align: center;
transition: transform 0.2s;
}
.product-card:hover {
transform: translateY(-4px);
box-shadow: 0 4px 12px rgba(0,0,0,0.1);
}
.product-image {
width: 100%;
height: 150px;
background-color: #f0f0f0;
margin-bottom: 1rem;
display: flex;
align-items: center;
justify-content: center;
font-size: 3rem;
}
.product-name {
font-weight: 600;
margin-bottom: 0.5rem;
}
.product-price {
color: var(--primary-color);
font-size: 1.2rem;
font-weight: 700;
}
/* Loading Indicator */
.loading-indicator {
text-align: center;
padding: 2rem;
}
.spinner {
border: 3px solid #f3f3f3;
border-top: 3px solid var(--primary-color);
border-radius: 50%;
width: 40px;
height: 40px;
animation: spin 1s linear infinite;
margin: 0 auto;
}
@keyframes spin {
0% { transform: rotate(0deg); }
100% { transform: rotate(360deg); }
}
/* Pagination */
.pagination {
display: flex;
gap: 0.5rem;
justify-content: center;
margin-top: 2rem;
}
.page-btn {
padding: 0.5rem 1rem;
border: 1px solid var(--border-color);
background-color: white;
cursor: pointer;
border-radius: 4px;
}
.page-btn:hover,
.page-btn.active {
background-color: var(--primary-color);
color: white;
}
/* Multi-step Form */
.progress-bar {
width: 100%;
height: 8px;
background-color: #e0e0e0;
border-radius: 4px;
margin-bottom: 2rem;
}
.progress-fill {
height: 100%;
background-color: var(--primary-color);
border-radius: 4px;
transition: width 0.3s;
}
.form-step {
display: none;
}
.form-step.active {
display: block;
}
/* Tabs */
.tabs-container {
margin-top: 2rem;
}
.tabs-header {
display: flex;
border-bottom: 2px solid var(--border-color);
}
.tab-btn {
background: none;
border: none;
padding: 1rem 2rem;
cursor: pointer;
font-size: 1rem;
font-weight: 500;
color: var(--text-color);
position: relative;
}
.tab-btn:hover {
color: var(--primary-color);
}
.tab-btn.active {
color: var(--primary-color);
}
.tab-btn.active::after {
content: '';
position: absolute;
bottom: -2px;
left: 0;
right: 0;
height: 2px;
background-color: var(--primary-color);
}
.tabs-content {
padding: 2rem 0;
}
.tab-pane {
display: none;
}
.tab-pane.active {
display: block;
}
/* Expandable Text */
.expandable-text {
margin-top: 1rem;
}
.text-preview {
margin-bottom: 0.5rem;
}
.show-more {
margin-top: 0.5rem;
}
/* Comments Section */
.comments-section {
margin-top: 1rem;
}
.comment {
background-color: #f8f9fa;
padding: 1rem;
border-radius: 4px;
margin-bottom: 1rem;
}
.comment-author {
font-weight: 600;
margin-bottom: 0.5rem;
}
/* Data Table */
.table-controls {
display: flex;
gap: 1rem;
margin-bottom: 1rem;
}
.search-input {
flex: 1;
max-width: 300px;
}
.data-table {
width: 100%;
border-collapse: collapse;
background-color: white;
}
.data-table th,
.data-table td {
padding: 0.75rem;
text-align: left;
border-bottom: 1px solid var(--border-color);
}
.data-table th {
background-color: #f8f9fa;
font-weight: 600;
}
.sortable {
cursor: pointer;
}
.sortable:hover {
color: var(--primary-color);
}
/* Form Cards */
.form-card {
background-color: white;
border: 1px solid var(--border-color);
border-radius: 8px;
padding: 2rem;
margin-bottom: 2rem;
}
.form-card h2 {
margin-top: 0;
}
/* Success Message */
.success-message {
background-color: #e6fff6;
color: var(--success-color);
padding: 1rem;
border-radius: 4px;
text-align: center;
font-weight: 500;
}
/* Load More Button */
.load-more,
.load-more-rows {
display: block;
margin: 2rem auto;
}
/* Responsive */
@media (max-width: 768px) {
.catalog-layout {
grid-template-columns: 1fr;
}
.feature-grid {
grid-template-columns: 1fr;
}
.nav-menu {
flex-wrap: wrap;
gap: 1rem;
}
.cookie-content {
flex-direction: column;
text-align: center;
}
}
/* Inspector Mode */
#inspector-btn.active {
background: var(--primary-color) !important;
color: var(--bg-primary) !important;
}
.inspector-tooltip {
box-shadow: 0 2px 8px rgba(0, 0, 0, 0.3);
border: 1px solid rgba(255, 255, 255, 0.1);
}

View File

@@ -0,0 +1,2 @@
flask>=2.3.0
flask-cors>=4.0.0

View File

@@ -0,0 +1,18 @@
# Basic Page Interaction
# This script demonstrates basic C4A commands
# Navigate to the playground
GO http://127.0.0.1:8080/playground/
# Wait for page to load
WAIT `body` 2
# Handle cookie banner if present
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
# Close newsletter popup if it appears
WAIT 3
IF (EXISTS `.newsletter-popup`) THEN CLICK `.close`
# Click the start tutorial button
CLICK `#start-tutorial`

View File

@@ -0,0 +1,27 @@
# Complete Login Flow
# Demonstrates form interaction and authentication
# Click login button
CLICK `#login-btn`
# Wait for login modal
WAIT `.login-form` 3
# Fill in credentials
CLICK `#email`
TYPE "demo@example.com"
CLICK `#password`
TYPE "demo123"
# Check remember me
IF (EXISTS `#remember-me`) THEN CLICK `#remember-me`
# Submit form
CLICK `button[type="submit"]`
# Wait for success
WAIT `.welcome-message` 5
# Verify login succeeded
IF (EXISTS `.user-info`) THEN EVAL `console.log('✅ Login successful!')`

View File

@@ -0,0 +1,32 @@
# Infinite Scroll Product Loading
# Load all products using scroll automation
# Navigate to catalog
CLICK `#catalog-link`
WAIT `.product-grid` 3
# Switch to infinite scroll mode
CLICK `#infinite-scroll-btn`
# Define scroll procedure
PROC load_more_products
# Get current product count
EVAL `window.initialCount = document.querySelectorAll('.product-card').length`
# Scroll down
SCROLL DOWN 1000
WAIT 2
# Check if more products loaded
EVAL `
const newCount = document.querySelectorAll('.product-card').length;
console.log('Products loaded: ' + newCount);
window.moreLoaded = newCount > window.initialCount;
`
ENDPROC
# Load products until no more
REPEAT (load_more_products, `window.moreLoaded !== false`)
# Final count
EVAL `console.log('✅ Total products: ' + document.querySelectorAll('.product-card').length)`

View File

@@ -0,0 +1,41 @@
# Multi-step Form Wizard
# Complete a complex form with multiple steps
# Navigate to forms section
CLICK `a[href="#forms"]`
WAIT `#survey-form` 2
# Step 1: Basic Information
CLICK `#full-name`
TYPE "John Doe"
CLICK `#survey-email`
TYPE "john.doe@example.com"
# Go to next step
CLICK `.next-step`
WAIT 1
# Step 2: Select Interests
# Select multiple options
CLICK `#interests`
CLICK `option[value="tech"]`
CLICK `option[value="music"]`
CLICK `option[value="travel"]`
# Continue to final step
CLICK `.next-step`
WAIT 1
# Step 3: Review and Submit
# Verify we're on the last step
IF (EXISTS `#submit-survey`) THEN EVAL `console.log('📋 On final step')`
# Submit the form
CLICK `#submit-survey`
# Wait for success message
WAIT `.success-message` 5
# Verify submission
IF (EXISTS `.success-message`) THEN EVAL `console.log('✅ Survey submitted successfully!')`

View File

@@ -0,0 +1,82 @@
# Complete E-commerce Workflow
# Login, browse products, and interact with various elements
# Define reusable procedures
PROC handle_popups
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
IF (EXISTS `.newsletter-popup`) THEN CLICK `.close`
ENDPROC
PROC login_user
CLICK `#login-btn`
WAIT `.login-form` 2
CLICK `#email`
TYPE "demo@example.com"
CLICK `#password`
TYPE "demo123"
CLICK `button[type="submit"]`
WAIT `.welcome-message` 5
ENDPROC
PROC browse_products
# Go to catalog
CLICK `#catalog-link`
WAIT `.product-grid` 3
# Apply filters
CLICK `.collapsible`
WAIT 0.5
CLICK `input[type="checkbox"]`
# Load some products
SCROLL DOWN 500
WAIT 1
SCROLL DOWN 500
WAIT 1
ENDPROC
# Main workflow
GO http://127.0.0.1:8080/playground/
WAIT `body` 2
# Handle initial popups
handle_popups
# Login if not already
IF (NOT EXISTS `.user-info`) THEN login_user
# Browse products
browse_products
# Navigate to tabs demo
CLICK `a[href="#tabs"]`
WAIT `.tabs-container` 2
# Interact with tabs
CLICK `button[data-tab="reviews"]`
WAIT 1
# Load comments
IF (EXISTS `.load-comments`) THEN CLICK `.load-comments`
WAIT `.comments-section` 2
# Check specifications
CLICK `button[data-tab="specs"]`
WAIT 1
# Final navigation to data tables
CLICK `a[href="#data"]`
WAIT `.data-table` 2
# Search in table
CLICK `.search-input`
TYPE "User"
# Load more rows
CLICK `.load-more-rows`
WAIT 1
# Export data
CLICK `#export-btn`
EVAL `console.log('✅ Workflow completed successfully!')`

View File

@@ -0,0 +1,304 @@
#!/usr/bin/env python3
"""
C4A-Script Tutorial Server
Serves the tutorial app and provides C4A compilation API
"""
import sys
import os
from pathlib import Path
from flask import Flask, render_template_string, request, jsonify, send_from_directory
from flask_cors import CORS
# Add parent directories to path to import crawl4ai
sys.path.insert(0, str(Path(__file__).parent.parent.parent.parent.parent))
try:
from crawl4ai.script import compile as c4a_compile
C4A_AVAILABLE = True
except ImportError:
print("⚠️ C4A compiler not available. Using mock compiler.")
C4A_AVAILABLE = False
app = Flask(__name__)
CORS(app)
# Serve static files
@app.route('/')
def index():
return send_from_directory('.', 'index.html')
@app.route('/assets/<path:path>')
def serve_assets(path):
return send_from_directory('assets', path)
@app.route('/playground/')
def playground():
return send_from_directory('playground', 'index.html')
@app.route('/playground/<path:path>')
def serve_playground(path):
return send_from_directory('playground', path)
# API endpoint for C4A compilation
@app.route('/api/compile', methods=['POST'])
def compile_endpoint():
try:
data = request.get_json()
script = data.get('script', '')
if not script:
return jsonify({
'success': False,
'error': {
'line': 1,
'column': 1,
'message': 'No script provided',
'suggestion': 'Write some C4A commands'
}
})
if C4A_AVAILABLE:
# Use real C4A compiler
result = c4a_compile(script)
if result.success:
return jsonify({
'success': True,
'jsCode': result.js_code,
'metadata': {
'lineCount': len(result.js_code),
'sourceLines': len(script.split('\n'))
}
})
else:
error = result.first_error
return jsonify({
'success': False,
'error': {
'line': error.line,
'column': error.column,
'message': error.message,
'suggestion': error.suggestions[0].message if error.suggestions else None,
'code': error.code,
'sourceLine': error.source_line
}
})
else:
# Use mock compiler for demo
result = mock_compile(script)
return jsonify(result)
except Exception as e:
return jsonify({
'success': False,
'error': {
'line': 1,
'column': 1,
'message': f'Server error: {str(e)}',
'suggestion': 'Check server logs'
}
}), 500
def mock_compile(script):
"""Simple mock compiler for demo when C4A is not available"""
lines = [line for line in script.split('\n') if line.strip() and not line.strip().startswith('#')]
js_code = []
for i, line in enumerate(lines):
line = line.strip()
try:
if line.startswith('GO '):
url = line[3:].strip()
# Handle relative URLs
if not url.startswith(('http://', 'https://')):
url = '/' + url.lstrip('/')
js_code.append(f"await page.goto('{url}');")
elif line.startswith('WAIT '):
parts = line[5:].strip().split(' ')
if parts[0].startswith('`'):
selector = parts[0].strip('`')
timeout = parts[1] if len(parts) > 1 else '5'
js_code.append(f"await page.waitForSelector('{selector}', {{ timeout: {timeout}000 }});")
else:
seconds = parts[0]
js_code.append(f"await page.waitForTimeout({seconds}000);")
elif line.startswith('CLICK '):
selector = line[6:].strip().strip('`')
js_code.append(f"await page.click('{selector}');")
elif line.startswith('TYPE '):
text = line[5:].strip().strip('"')
js_code.append(f"await page.keyboard.type('{text}');")
elif line.startswith('SCROLL '):
parts = line[7:].strip().split(' ')
direction = parts[0]
amount = parts[1] if len(parts) > 1 else '500'
if direction == 'DOWN':
js_code.append(f"await page.evaluate(() => window.scrollBy(0, {amount}));")
elif direction == 'UP':
js_code.append(f"await page.evaluate(() => window.scrollBy(0, -{amount}));")
elif line.startswith('IF '):
if 'THEN' not in line:
return {
'success': False,
'error': {
'line': i + 1,
'column': len(line),
'message': "Missing 'THEN' keyword after IF condition",
'suggestion': "Add 'THEN' after the condition",
'sourceLine': line
}
}
condition = line[3:line.index('THEN')].strip()
action = line[line.index('THEN') + 4:].strip()
if 'EXISTS' in condition:
selector_match = condition.split('`')
if len(selector_match) >= 2:
selector = selector_match[1]
action_selector = action.split('`')[1] if '`' in action else ''
js_code.append(
f"if (await page.$$('{selector}').length > 0) {{ "
f"await page.click('{action_selector}'); }}"
)
elif line.startswith('PRESS '):
key = line[6:].strip()
js_code.append(f"await page.keyboard.press('{key}');")
else:
# Unknown command
return {
'success': False,
'error': {
'line': i + 1,
'column': 1,
'message': f"Unknown command: {line.split()[0]}",
'suggestion': "Check command syntax",
'sourceLine': line
}
}
except Exception as e:
return {
'success': False,
'error': {
'line': i + 1,
'column': 1,
'message': f"Failed to parse: {str(e)}",
'suggestion': "Check syntax",
'sourceLine': line
}
}
return {
'success': True,
'jsCode': js_code,
'metadata': {
'lineCount': len(js_code),
'sourceLines': len(lines)
}
}
# Example scripts endpoint
@app.route('/api/examples')
def get_examples():
examples = [
{
'id': 'cookie-banner',
'name': 'Handle Cookie Banner',
'description': 'Accept cookies and close newsletter popup',
'script': '''# Handle cookie banner and newsletter
GO http://127.0.0.1:8080/playground/
WAIT `body` 2
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
IF (EXISTS `.newsletter-popup`) THEN CLICK `.close`'''
},
{
'id': 'login',
'name': 'Login Flow',
'description': 'Complete login with credentials',
'script': '''# Login to the site
CLICK `#login-btn`
WAIT `.login-form` 2
CLICK `#email`
TYPE "demo@example.com"
CLICK `#password`
TYPE "demo123"
IF (EXISTS `#remember-me`) THEN CLICK `#remember-me`
CLICK `button[type="submit"]`
WAIT `.welcome-message` 5'''
},
{
'id': 'infinite-scroll',
'name': 'Infinite Scroll',
'description': 'Load products with scrolling',
'script': '''# Navigate to catalog and scroll
CLICK `#catalog-link`
WAIT `.product-grid` 3
# Scroll multiple times to load products
SCROLL DOWN 1000
WAIT 1
SCROLL DOWN 1000
WAIT 1
SCROLL DOWN 1000'''
},
{
'id': 'form-wizard',
'name': 'Multi-step Form',
'description': 'Complete a multi-step survey',
'script': '''# Navigate to forms
CLICK `a[href="#forms"]`
WAIT `#survey-form` 2
# Step 1: Basic info
CLICK `#full-name`
TYPE "John Doe"
CLICK `#survey-email`
TYPE "john@example.com"
CLICK `.next-step`
WAIT 1
# Step 2: Preferences
CLICK `#interests`
CLICK `option[value="tech"]`
CLICK `option[value="music"]`
CLICK `.next-step`
WAIT 1
# Step 3: Submit
CLICK `#submit-survey`
WAIT `.success-message` 5'''
}
]
return jsonify(examples)
if __name__ == '__main__':
port = int(os.environ.get('PORT', 8000))
print(f"""
╔══════════════════════════════════════════════════════════╗
║ C4A-Script Interactive Tutorial Server ║
╠══════════════════════════════════════════════════════════╣
║ ║
║ Server running at: http://localhost:{port:<6}
║ ║
║ Features: ║
║ • C4A-Script compilation API ║
║ • Interactive playground ║
║ • Real-time execution visualization ║
║ ║
║ C4A Compiler: {'✓ Available' if C4A_AVAILABLE else '✗ Using mock compiler':<30}
║ ║
╚══════════════════════════════════════════════════════════╝
""")
app.run(host='0.0.0.0', port=port, debug=True)

View File

@@ -0,0 +1,69 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Blockly Test</title>
<style>
body {
margin: 0;
padding: 20px;
background: #0e0e10;
color: #e0e0e0;
font-family: monospace;
}
#blocklyDiv {
height: 600px;
width: 100%;
border: 1px solid #2a2a2c;
}
#output {
margin-top: 20px;
padding: 15px;
background: #1a1a1b;
border: 1px solid #2a2a2c;
white-space: pre-wrap;
}
</style>
</head>
<body>
<h1>C4A-Script Blockly Test</h1>
<div id="blocklyDiv"></div>
<div id="output">
<h3>Generated C4A-Script:</h3>
<pre id="code-output"></pre>
</div>
<script src="https://unpkg.com/blockly/blockly.min.js"></script>
<script src="assets/c4a-blocks.js"></script>
<script>
// Simple test
const workspace = Blockly.inject('blocklyDiv', {
toolbox: `
<xml>
<category name="Test" colour="#1E88E5">
<block type="c4a_go"></block>
<block type="c4a_wait_time"></block>
<block type="c4a_click"></block>
</category>
</xml>
`,
theme: Blockly.Theme.defineTheme('dark', {
'base': Blockly.Themes.Classic,
'componentStyles': {
'workspaceBackgroundColour': '#0e0e10',
'toolboxBackgroundColour': '#1a1a1b',
'toolboxForegroundColour': '#e0e0e0',
'flyoutBackgroundColour': '#1a1a1b',
'flyoutForegroundColour': '#e0e0e0',
}
})
});
workspace.addChangeListener((event) => {
const code = Blockly.JavaScript.workspaceToCode(workspace);
document.getElementById('code-output').textContent = code;
});
</script>
</body>
</html>

View File

@@ -12,7 +12,7 @@ import os
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import LLMConfig
from crawl4ai.extraction_strategy import (
from crawl4ai import (
LLMExtractionStrategy,
JsonCssExtractionStrategy,
JsonXPathExtractionStrategy,

View File

@@ -0,0 +1,376 @@
#!/usr/bin/env python3
"""
Link Head Extraction & Scoring Example
This example demonstrates Crawl4AI's advanced link analysis capabilities:
1. Basic link head extraction
2. Three-layer scoring system (intrinsic, contextual, total)
3. Pattern-based filtering
4. Multiple practical use cases
Requirements:
- crawl4ai installed
- Internet connection
Usage:
python link_head_extraction_example.py
"""
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.async_configs import LinkPreviewConfig
async def basic_link_head_extraction():
"""
Basic example: Extract head content from internal links with scoring
"""
print("🔗 Basic Link Head Extraction Example")
print("=" * 50)
config = CrawlerRunConfig(
# Enable link head extraction
link_preview_config=LinkPreviewConfig(
include_internal=True, # Process internal links
include_external=False, # Skip external links for this demo
max_links=5, # Limit to 5 links
concurrency=3, # Process 3 links simultaneously
timeout=10, # 10 second timeout per link
query="API documentation guide", # Query for relevance scoring
verbose=True # Show detailed progress
),
# Enable intrinsic link scoring
score_links=True,
only_text=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.python.org/3/", config=config)
if result.success:
print(f"\n✅ Successfully crawled: {result.url}")
internal_links = result.links.get("internal", [])
links_with_head = [link for link in internal_links
if link.get("head_data") is not None]
print(f"🧠 Links with head data: {len(links_with_head)}")
# Show detailed results
for i, link in enumerate(links_with_head[:3]):
print(f"\n📄 Link {i+1}: {link['href']}")
print(f" Text: '{link.get('text', 'No text')[:50]}...'")
# Show all three score types
intrinsic = link.get('intrinsic_score')
contextual = link.get('contextual_score')
total = link.get('total_score')
print(f" 📊 Scores:")
if intrinsic is not None:
print(f" • Intrinsic: {intrinsic:.2f}/10.0")
if contextual is not None:
print(f" • Contextual: {contextual:.3f}")
if total is not None:
print(f" • Total: {total:.3f}")
# Show head data
head_data = link.get("head_data", {})
if head_data:
title = head_data.get("title", "No title")
description = head_data.get("meta", {}).get("description", "")
print(f" 📰 Title: {title[:60]}...")
if description:
print(f" 📝 Description: {description[:80]}...")
else:
print(f"❌ Crawl failed: {result.error_message}")
async def research_assistant_example():
"""
Research Assistant: Find highly relevant documentation pages
"""
print("\n\n🔍 Research Assistant Example")
print("=" * 50)
config = CrawlerRunConfig(
link_preview_config=LinkPreviewConfig(
include_internal=True,
include_external=True,
include_patterns=["*/docs/*", "*/tutorial/*", "*/guide/*"],
exclude_patterns=["*/login*", "*/admin*"],
query="machine learning neural networks deep learning",
max_links=15,
score_threshold=0.4, # Only include high-relevance links
concurrency=8,
verbose=False # Clean output for this example
),
score_links=True
)
# Test with scikit-learn documentation
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://scikit-learn.org/stable/", config=config)
if result.success:
print(f"✅ Analyzed: {result.url}")
all_links = result.links.get("internal", []) + result.links.get("external", [])
# Filter for high-scoring links
high_scoring_links = [link for link in all_links
if link.get("total_score", 0) > 0.6]
# Sort by total score (highest first)
high_scoring_links.sort(key=lambda x: x.get("total_score", 0), reverse=True)
print(f"\n🎯 Found {len(high_scoring_links)} highly relevant links:")
print(" (Showing top 5 by relevance score)")
for i, link in enumerate(high_scoring_links[:5]):
score = link.get("total_score", 0)
title = link.get("head_data", {}).get("title", "No title")
print(f"\n{i+1}. ⭐ {score:.3f} - {title[:70]}...")
print(f" 🔗 {link['href']}")
# Show score breakdown
intrinsic = link.get('intrinsic_score', 0)
contextual = link.get('contextual_score', 0)
print(f" 📊 Quality: {intrinsic:.1f}/10 | Relevance: {contextual:.3f}")
else:
print(f"❌ Research failed: {result.error_message}")
async def api_discovery_example():
"""
API Discovery: Find API endpoints and references
"""
print("\n\n🔧 API Discovery Example")
print("=" * 50)
config = CrawlerRunConfig(
link_preview_config=LinkPreviewConfig(
include_internal=True,
include_patterns=["*/api/*", "*/reference/*", "*/endpoint/*"],
exclude_patterns=["*/deprecated/*", "*/v1/*"], # Skip old versions
max_links=25,
concurrency=10,
timeout=8,
verbose=False
),
score_links=True
)
# Example with a documentation site that has API references
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://httpbin.org/", config=config)
if result.success:
print(f"✅ Discovered APIs at: {result.url}")
api_links = result.links.get("internal", [])
# Categorize by detected content
endpoints = {"GET": [], "POST": [], "PUT": [], "DELETE": [], "OTHER": []}
for link in api_links:
if link.get("head_data"):
title = link.get("head_data", {}).get("title", "").upper()
text = link.get("text", "").upper()
# Simple categorization based on content
if "GET" in title or "GET" in text:
endpoints["GET"].append(link)
elif "POST" in title or "POST" in text:
endpoints["POST"].append(link)
elif "PUT" in title or "PUT" in text:
endpoints["PUT"].append(link)
elif "DELETE" in title or "DELETE" in text:
endpoints["DELETE"].append(link)
else:
endpoints["OTHER"].append(link)
# Display results
total_found = sum(len(links) for links in endpoints.values())
print(f"\n📡 Found {total_found} API-related links:")
for method, links in endpoints.items():
if links:
print(f"\n{method} Endpoints ({len(links)}):")
for link in links[:3]: # Show first 3 of each type
title = link.get("head_data", {}).get("title", "No title")
score = link.get("intrinsic_score", 0)
print(f" • [{score:.1f}] {title[:50]}...")
print(f" {link['href']}")
else:
print(f"❌ API discovery failed: {result.error_message}")
async def link_quality_analysis():
"""
Link Quality Analysis: Analyze website structure and link quality
"""
print("\n\n📊 Link Quality Analysis Example")
print("=" * 50)
config = CrawlerRunConfig(
link_preview_config=LinkPreviewConfig(
include_internal=True,
max_links=30, # Analyze more links for better statistics
concurrency=15,
timeout=6,
verbose=False
),
score_links=True
)
async with AsyncWebCrawler() as crawler:
# Test with a content-rich site
result = await crawler.arun("https://docs.python.org/3/", config=config)
if result.success:
print(f"✅ Analyzed: {result.url}")
links = result.links.get("internal", [])
# Extract intrinsic scores for analysis
scores = [link.get('intrinsic_score', 0) for link in links if link.get('intrinsic_score') is not None]
if scores:
avg_score = sum(scores) / len(scores)
high_quality = len([s for s in scores if s >= 7.0])
medium_quality = len([s for s in scores if 4.0 <= s < 7.0])
low_quality = len([s for s in scores if s < 4.0])
print(f"\n📈 Quality Analysis Results:")
print(f" 📊 Average Score: {avg_score:.2f}/10.0")
print(f" 🟢 High Quality (≥7.0): {high_quality} links")
print(f" 🟡 Medium Quality (4.0-6.9): {medium_quality} links")
print(f" 🔴 Low Quality (<4.0): {low_quality} links")
# Show best and worst links
scored_links = [(link, link.get('intrinsic_score', 0)) for link in links
if link.get('intrinsic_score') is not None]
scored_links.sort(key=lambda x: x[1], reverse=True)
print(f"\n🏆 Top 3 Quality Links:")
for i, (link, score) in enumerate(scored_links[:3]):
text = link.get('text', 'No text')[:40]
print(f" {i+1}. [{score:.1f}] {text}...")
print(f" {link['href']}")
print(f"\n⚠️ Bottom 3 Quality Links:")
for i, (link, score) in enumerate(scored_links[-3:]):
text = link.get('text', 'No text')[:40]
print(f" {i+1}. [{score:.1f}] {text}...")
print(f" {link['href']}")
else:
print("❌ No scoring data available")
else:
print(f"❌ Analysis failed: {result.error_message}")
async def pattern_filtering_example():
"""
Pattern Filtering: Demonstrate advanced filtering capabilities
"""
print("\n\n🎯 Pattern Filtering Example")
print("=" * 50)
# Example with multiple filtering strategies
filters = [
{
"name": "Documentation Only",
"config": LinkPreviewConfig(
include_internal=True,
max_links=10,
concurrency=5,
verbose=False,
include_patterns=["*/docs/*", "*/documentation/*"],
exclude_patterns=["*/api/*"]
)
},
{
"name": "API References Only",
"config": LinkPreviewConfig(
include_internal=True,
max_links=10,
concurrency=5,
verbose=False,
include_patterns=["*/api/*", "*/reference/*"],
exclude_patterns=["*/tutorial/*"]
)
},
{
"name": "Exclude Admin Areas",
"config": LinkPreviewConfig(
include_internal=True,
max_links=10,
concurrency=5,
verbose=False,
exclude_patterns=["*/admin/*", "*/login/*", "*/dashboard/*"]
)
}
]
async with AsyncWebCrawler() as crawler:
for filter_example in filters:
print(f"\n🔍 Testing: {filter_example['name']}")
config = CrawlerRunConfig(
link_preview_config=filter_example['config'],
score_links=True
)
result = await crawler.arun("https://docs.python.org/3/", config=config)
if result.success:
links = result.links.get("internal", [])
links_with_head = [link for link in links if link.get("head_data")]
print(f" 📊 Found {len(links_with_head)} matching links")
if links_with_head:
# Show sample matches
for link in links_with_head[:2]:
title = link.get("head_data", {}).get("title", "No title")
print(f"{title[:50]}...")
print(f" {link['href']}")
else:
print(f" ❌ Failed: {result.error_message}")
async def main():
"""
Run all examples
"""
print("🚀 Crawl4AI Link Head Extraction Examples")
print("=" * 60)
print("This will demonstrate various link analysis capabilities.\n")
try:
# Run all examples
await basic_link_head_extraction()
await research_assistant_example()
await api_discovery_example()
await link_quality_analysis()
await pattern_filtering_example()
print("\n" + "=" * 60)
print("✨ All examples completed successfully!")
print("\nNext steps:")
print("1. Try modifying the queries and patterns above")
print("2. Test with your own websites")
print("3. Experiment with different score thresholds")
print("4. Check out the full documentation for more options")
except KeyboardInterrupt:
print("\n⏹️ Examples interrupted by user")
except Exception as e:
print(f"\n💥 Error running examples: {str(e)}")
import traceback
traceback.print_exc()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -518,7 +518,7 @@
}
],
"source": [
"from crawl4ai.extraction_strategy import LLMExtractionStrategy\n",
"from crawl4ai import LLMExtractionStrategy\n",
"from pydantic import BaseModel, Field\n",
"import os, json\n",
"\n",
@@ -594,7 +594,7 @@
}
],
"source": [
"from crawl4ai.extraction_strategy import CosineStrategy\n",
"from crawl4ai import CosineStrategy\n",
"\n",
"async def cosine_similarity_extraction():\n",
" async with AsyncWebCrawler() as crawler:\n",

View File

@@ -16,7 +16,7 @@ from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.extraction_strategy import (
from crawl4ai import (
JsonCssExtractionStrategy,
LLMExtractionStrategy,
)
@@ -416,7 +416,7 @@ async def crawl_dynamic_content_pages_method_2():
async def cosine_similarity_extraction():
from crawl4ai.extraction_strategy import CosineStrategy
from crawl4ai import CosineStrategy
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=CosineStrategy(

View File

@@ -16,7 +16,7 @@ from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.extraction_strategy import (
from crawl4ai import (
JsonCssExtractionStrategy,
LLMExtractionStrategy,
)
@@ -416,7 +416,7 @@ async def crawl_dynamic_content_pages_method_2():
async def cosine_similarity_extraction():
from crawl4ai.extraction_strategy import CosineStrategy
from crawl4ai import CosineStrategy
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=CosineStrategy(

View File

@@ -2,7 +2,7 @@ import os
import json
from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
from crawl4ai import *
from crawl4ai.crawler_strategy import *
url = r"https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot"

View File

@@ -18,7 +18,7 @@ from crawl4ai import RoundRobinProxyStrategy
from crawl4ai.content_filter_strategy import LLMContentFilter
from crawl4ai import DefaultMarkdownGenerator
from crawl4ai import LLMConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
from pprint import pprint

View File

@@ -0,0 +1,367 @@
"""
Example of using the virtual scroll feature to capture content from pages
with virtualized scrolling (like Twitter, Instagram, or other infinite scroll feeds).
This example demonstrates virtual scroll with a local test server serving
different types of scrolling behaviors from HTML files in the assets directory.
"""
import asyncio
import os
import http.server
import socketserver
import threading
from pathlib import Path
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, VirtualScrollConfig, CacheMode, BrowserConfig
# Get the assets directory path
ASSETS_DIR = Path(__file__).parent / "assets"
class TestServer:
"""Simple HTTP server to serve our test HTML files"""
def __init__(self, port=8080):
self.port = port
self.httpd = None
self.server_thread = None
async def start(self):
"""Start the test server"""
Handler = http.server.SimpleHTTPRequestHandler
# Save current directory and change to assets directory
self.original_cwd = os.getcwd()
os.chdir(ASSETS_DIR)
# Try to find an available port
for _ in range(10):
try:
self.httpd = socketserver.TCPServer(("", self.port), Handler)
break
except OSError:
self.port += 1
if self.httpd is None:
raise RuntimeError("Could not find available port")
self.server_thread = threading.Thread(target=self.httpd.serve_forever)
self.server_thread.daemon = True
self.server_thread.start()
# Give server time to start
await asyncio.sleep(0.5)
print(f"Test server started on http://localhost:{self.port}")
return self.port
def stop(self):
"""Stop the test server"""
if self.httpd:
self.httpd.shutdown()
# Restore original directory
if hasattr(self, 'original_cwd'):
os.chdir(self.original_cwd)
async def example_twitter_like_virtual_scroll():
"""
Example 1: Twitter-like virtual scroll where content is REPLACED.
This is the classic virtual scroll use case - only visible items exist in DOM.
"""
print("\n" + "="*60)
print("EXAMPLE 1: Twitter-like Virtual Scroll")
print("="*60)
server = TestServer()
port = await server.start()
try:
# Configure virtual scroll for Twitter-like timeline
virtual_config = VirtualScrollConfig(
container_selector="#timeline", # The scrollable container
scroll_count=50, # Scroll up to 50 times to get all content
scroll_by="container_height", # Scroll by container's height
wait_after_scroll=0.3 # Wait 300ms after each scroll
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config,
cache_mode=CacheMode.BYPASS
)
# TIP: Set headless=False to watch the scrolling happen!
browser_config = BrowserConfig(
headless=False,
viewport={"width": 1280, "height": 800}
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=f"http://localhost:{port}/virtual_scroll_twitter_like.html",
config=config
)
# Count tweets captured
import re
tweets = re.findall(r'data-tweet-id="(\d+)"', result.html)
unique_tweets = sorted(set(int(id) for id in tweets))
print(f"\n📊 Results:")
print(f" Total HTML length: {len(result.html):,} characters")
print(f" Tweets captured: {len(unique_tweets)} unique tweets")
if unique_tweets:
print(f" Tweet IDs range: {min(unique_tweets)} to {max(unique_tweets)}")
print(f" Expected range: 0 to 499 (500 tweets total)")
if len(unique_tweets) == 500:
print(f" ✅ SUCCESS! All tweets captured!")
else:
print(f" ⚠️ Captured {len(unique_tweets)}/500 tweets")
finally:
server.stop()
async def example_traditional_append_scroll():
"""
Example 2: Traditional infinite scroll where content is APPENDED.
No virtual scroll needed - all content stays in DOM.
"""
print("\n" + "="*60)
print("EXAMPLE 2: Traditional Append-Only Scroll")
print("="*60)
server = TestServer()
port = await server.start()
try:
# Configure virtual scroll
virtual_config = VirtualScrollConfig(
container_selector=".posts-container",
scroll_count=15, # Less scrolls needed since content accumulates
scroll_by=500, # Scroll by 500 pixels
wait_after_scroll=0.4
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=f"http://localhost:{port}/virtual_scroll_append_only.html",
config=config
)
# Count posts
import re
posts = re.findall(r'data-post-id="(\d+)"', result.html)
unique_posts = sorted(set(int(id) for id in posts))
print(f"\n📊 Results:")
print(f" Total HTML length: {len(result.html):,} characters")
print(f" Posts captured: {len(unique_posts)} unique posts")
if unique_posts:
print(f" Post IDs range: {min(unique_posts)} to {max(unique_posts)}")
print(f" Note: This page appends content, so virtual scroll")
print(f" just helps trigger more loads. All content stays in DOM.")
finally:
server.stop()
async def example_instagram_grid():
"""
Example 3: Instagram-like grid with virtual scroll.
Grid layout where only visible rows are rendered.
"""
print("\n" + "="*60)
print("EXAMPLE 3: Instagram Grid Virtual Scroll")
print("="*60)
server = TestServer()
port = await server.start()
try:
# Configure for grid layout
virtual_config = VirtualScrollConfig(
container_selector=".feed-container", # Container with the grid
scroll_count=100, # Many scrolls for 999 posts
scroll_by="container_height",
wait_after_scroll=0.2 # Faster scrolling for grid
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config,
cache_mode=CacheMode.BYPASS,
screenshot=True # Take a screenshot of the final grid
)
# Show browser for this visual example
browser_config = BrowserConfig(
headless=False,
viewport={"width": 1200, "height": 900}
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=f"http://localhost:{port}/virtual_scroll_instagram_grid.html",
config=config
)
# Count posts in grid
import re
posts = re.findall(r'data-post-id="(\d+)"', result.html)
unique_posts = sorted(set(int(id) for id in posts))
print(f"\n📊 Results:")
print(f" Posts in grid: {len(unique_posts)} unique posts")
if unique_posts:
print(f" Post IDs range: {min(unique_posts)} to {max(unique_posts)}")
print(f" Expected: 0 to 998 (999 posts total)")
# Save screenshot
if result.screenshot:
import base64
with open("instagram_grid_result.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
print(f" 📸 Screenshot saved as instagram_grid_result.png")
finally:
server.stop()
async def example_mixed_content():
"""
Example 4: News feed with mixed behavior.
Featured articles stay (no virtual scroll), regular articles are virtualized.
"""
print("\n" + "="*60)
print("EXAMPLE 4: News Feed with Mixed Behavior")
print("="*60)
server = TestServer()
port = await server.start()
try:
# Configure virtual scroll
virtual_config = VirtualScrollConfig(
container_selector="#newsContainer",
scroll_count=25,
scroll_by="container_height",
wait_after_scroll=0.3
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=f"http://localhost:{port}/virtual_scroll_news_feed.html",
config=config
)
# Count different types of articles
import re
featured = re.findall(r'data-article-id="featured-\d+"', result.html)
regular = re.findall(r'data-article-id="article-(\d+)"', result.html)
print(f"\n📊 Results:")
print(f" Featured articles: {len(set(featured))} (always visible)")
print(f" Regular articles: {len(set(regular))} unique articles")
if regular:
regular_ids = sorted(set(int(id) for id in regular))
print(f" Regular article IDs: {min(regular_ids)} to {max(regular_ids)}")
print(f" Note: Featured articles stay in DOM, only regular")
print(f" articles are replaced during virtual scroll")
finally:
server.stop()
async def compare_with_without_virtual_scroll():
"""
Comparison: Show the difference between crawling with and without virtual scroll.
"""
print("\n" + "="*60)
print("COMPARISON: With vs Without Virtual Scroll")
print("="*60)
server = TestServer()
port = await server.start()
try:
url = f"http://localhost:{port}/virtual_scroll_twitter_like.html"
# First, crawl WITHOUT virtual scroll
print("\n1⃣ Crawling WITHOUT virtual scroll...")
async with AsyncWebCrawler() as crawler:
config_normal = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
result_normal = await crawler.arun(url=url, config=config_normal)
# Count items
import re
tweets_normal = len(set(re.findall(r'data-tweet-id="(\d+)"', result_normal.html)))
# Then, crawl WITH virtual scroll
print("2⃣ Crawling WITH virtual scroll...")
virtual_config = VirtualScrollConfig(
container_selector="#timeline",
scroll_count=50,
scroll_by="container_height",
wait_after_scroll=0.2
)
config_virtual = CrawlerRunConfig(
virtual_scroll_config=virtual_config,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result_virtual = await crawler.arun(url=url, config=config_virtual)
# Count items
tweets_virtual = len(set(re.findall(r'data-tweet-id="(\d+)"', result_virtual.html)))
# Compare results
print(f"\n📊 Comparison Results:")
print(f" Without virtual scroll: {tweets_normal} tweets (only initial visible)")
print(f" With virtual scroll: {tweets_virtual} tweets (all content captured)")
print(f" Improvement: {tweets_virtual / tweets_normal if tweets_normal > 0 else 'N/A':.1f}x more content!")
print(f"\n HTML size without: {len(result_normal.html):,} characters")
print(f" HTML size with: {len(result_virtual.html):,} characters")
finally:
server.stop()
if __name__ == "__main__":
print("""
╔════════════════════════════════════════════════════════════╗
║ Virtual Scroll Examples for Crawl4AI ║
╚════════════════════════════════════════════════════════════╝
These examples demonstrate different virtual scroll scenarios:
1. Twitter-like (content replaced) - Classic virtual scroll
2. Traditional append - Content accumulates
3. Instagram grid - Visual grid layout
4. Mixed behavior - Some content stays, some virtualizes
Starting examples...
""")
# Run all examples
asyncio.run(example_twitter_like_virtual_scroll())
asyncio.run(example_traditional_append_scroll())
asyncio.run(example_instagram_grid())
asyncio.run(example_mixed_content())
asyncio.run(compare_with_without_virtual_scroll())
print("\n✅ All examples completed!")
print("\nTIP: Set headless=False in BrowserConfig to watch the scrolling in action!")

View File

@@ -6,7 +6,7 @@ Many websites now load images **lazily** as you scroll. If you need to ensure th
2. **`scan_full_page`** Force the crawler to scroll the entire page, triggering lazy loads.
3. **`scroll_delay`** Add small delays between scroll steps.
**Note**: If the site requires multiple “Load More” triggers or complex interactions, see the [Page Interaction docs](../core/page-interaction.md).
**Note**: If the site requires multiple “Load More” triggers or complex interactions, see the [Page Interaction docs](../core/page-interaction.md). For sites with virtual scrolling (Twitter/Instagram style), see the [Virtual Scroll docs](virtual-scroll.md).
### Example: Ensuring Lazy Images Appear

View File

@@ -45,7 +45,7 @@ Here's an example of crawling GitHub commits across multiple pages while preserv
```python
from crawl4ai.async_configs import CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai.cache_context import CacheMode
async def crawl_dynamic_content():

View File

@@ -0,0 +1,310 @@
# Virtual Scroll
Modern websites increasingly use **virtual scrolling** (also called windowed rendering or viewport rendering) to handle large datasets efficiently. This technique only renders visible items in the DOM, replacing content as users scroll. Popular examples include Twitter's timeline, Instagram's feed, and many data tables.
Crawl4AI's Virtual Scroll feature automatically detects and handles these scenarios, ensuring you capture **all content**, not just what's initially visible.
## Understanding Virtual Scroll
### The Problem
Traditional infinite scroll **appends** new content to existing content. Virtual scroll **replaces** content to maintain performance:
```
Traditional Scroll: Virtual Scroll:
┌─────────────┐ ┌─────────────┐
│ Item 1 │ │ Item 11 │ <- Items 1-10 removed
│ Item 2 │ │ Item 12 │ <- Only visible items
│ ... │ │ Item 13 │ in DOM
│ Item 10 │ │ Item 14 │
│ Item 11 NEW │ │ Item 15 │
│ Item 12 NEW │ └─────────────┘
└─────────────┘
DOM keeps growing DOM size stays constant
```
Without proper handling, crawlers only capture the currently visible items, missing the rest of the content.
### Three Scrolling Scenarios
Crawl4AI's Virtual Scroll detects and handles three scenarios:
1. **No Change** - Content doesn't update on scroll (static page or end reached)
2. **Content Appended** - New items added to existing ones (traditional infinite scroll)
3. **Content Replaced** - Items replaced with new ones (true virtual scroll)
Only scenario 3 requires special handling, which Virtual Scroll automates.
## Basic Usage
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, VirtualScrollConfig
# Configure virtual scroll
virtual_config = VirtualScrollConfig(
container_selector="#feed", # CSS selector for scrollable container
scroll_count=20, # Number of scrolls to perform
scroll_by="container_height", # How much to scroll each time
wait_after_scroll=0.5 # Wait time (seconds) after each scroll
)
# Use in crawler configuration
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
# result.html contains ALL items from the virtual scroll
```
## Configuration Parameters
### VirtualScrollConfig
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `container_selector` | `str` | Required | CSS selector for the scrollable container |
| `scroll_count` | `int` | `10` | Maximum number of scrolls to perform |
| `scroll_by` | `str` or `int` | `"container_height"` | Scroll amount per step |
| `wait_after_scroll` | `float` | `0.5` | Seconds to wait after each scroll |
### Scroll By Options
- `"container_height"` - Scroll by the container's visible height
- `"page_height"` - Scroll by the viewport height
- `500` (integer) - Scroll by exact pixel amount
## Real-World Examples
### Twitter-like Timeline
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, VirtualScrollConfig, BrowserConfig
async def crawl_twitter_timeline():
# Twitter replaces tweets as you scroll
virtual_config = VirtualScrollConfig(
container_selector="[data-testid='primaryColumn']",
scroll_count=30,
scroll_by="container_height",
wait_after_scroll=1.0 # Twitter needs time to load
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config,
# Optional: Set headless=False to watch it work
# browser_config=BrowserConfig(headless=False)
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://twitter.com/search?q=AI",
config=config
)
# Extract tweet count
import re
tweets = re.findall(r'data-testid="tweet"', result.html)
print(f"Captured {len(tweets)} tweets")
```
### Instagram Grid
```python
async def crawl_instagram_grid():
# Instagram uses virtualized grid for performance
virtual_config = VirtualScrollConfig(
container_selector="article", # Main feed container
scroll_count=50, # More scrolls for grid layout
scroll_by=800, # Fixed pixel scrolling
wait_after_scroll=0.8
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config,
screenshot=True # Capture final state
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.instagram.com/explore/tags/photography/",
config=config
)
# Count posts
posts = result.html.count('class="post"')
print(f"Captured {posts} posts from virtualized grid")
```
### Mixed Content (News Feed)
Some sites mix static and virtualized content:
```python
async def crawl_mixed_feed():
# Featured articles stay, regular articles virtualize
virtual_config = VirtualScrollConfig(
container_selector=".main-feed",
scroll_count=25,
scroll_by="container_height",
wait_after_scroll=0.5
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.example.com",
config=config
)
# Featured articles remain throughout
featured = result.html.count('class="featured-article"')
regular = result.html.count('class="regular-article"')
print(f"Featured (static): {featured}")
print(f"Regular (virtualized): {regular}")
```
## Virtual Scroll vs scan_full_page
Both features handle dynamic content, but serve different purposes:
| Feature | Virtual Scroll | scan_full_page |
|---------|---------------|----------------|
| **Purpose** | Capture content that's replaced during scroll | Load content that's appended during scroll |
| **Use Case** | Twitter, Instagram, virtual tables | Traditional infinite scroll, lazy-loaded images |
| **DOM Behavior** | Replaces elements | Adds elements |
| **Memory Usage** | Efficient (merges content) | Can grow large |
| **Configuration** | Requires container selector | Works on full page |
### When to Use Which?
Use **Virtual Scroll** when:
- Content disappears as you scroll (Twitter timeline)
- DOM element count stays relatively constant
- You need ALL items from a virtualized list
- Container-based scrolling (not full page)
Use **scan_full_page** when:
- Content accumulates as you scroll
- Images load lazily
- Simple "load more" behavior
- Full page scrolling
## Combining with Extraction
Virtual Scroll works seamlessly with extraction strategies:
```python
from crawl4ai import LLMExtractionStrategy
# Define extraction schema
schema = {
"type": "array",
"items": {
"type": "object",
"properties": {
"author": {"type": "string"},
"content": {"type": "string"},
"timestamp": {"type": "string"}
}
}
}
# Configure both virtual scroll and extraction
config = CrawlerRunConfig(
virtual_scroll_config=VirtualScrollConfig(
container_selector="#timeline",
scroll_count=20
),
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
schema=schema
)
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="...", config=config)
# Extracted data from ALL scrolled content
import json
posts = json.loads(result.extracted_content)
print(f"Extracted {len(posts)} posts from virtual scroll")
```
## Performance Tips
1. **Container Selection**: Be specific with selectors. Using the correct container improves performance.
2. **Scroll Count**: Start conservative and increase as needed:
```python
# Start with fewer scrolls
virtual_config = VirtualScrollConfig(
container_selector="#feed",
scroll_count=10 # Test with 10, increase if needed
)
```
3. **Wait Times**: Adjust based on site speed:
```python
# Fast sites
wait_after_scroll=0.2
# Slower sites or heavy content
wait_after_scroll=1.5
```
4. **Debug Mode**: Set `headless=False` to watch scrolling:
```python
browser_config = BrowserConfig(headless=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
# Watch the scrolling happen
```
## How It Works Internally
1. **Detection Phase**: Scrolls and compares HTML to detect behavior
2. **Capture Phase**: For replaced content, stores HTML chunks at each position
3. **Merge Phase**: Combines all chunks, removing duplicates based on text content
4. **Result**: Complete HTML with all unique items
The deduplication uses normalized text (lowercase, no spaces/symbols) to ensure accurate merging without false positives.
## Error Handling
Virtual Scroll handles errors gracefully:
```python
# If container not found or scrolling fails
result = await crawler.arun(url="...", config=config)
if result.success:
# Virtual scroll worked or wasn't needed
print(f"Captured {len(result.html)} characters")
else:
# Crawl failed entirely
print(f"Error: {result.error_message}")
```
If the container isn't found, crawling continues normally without virtual scroll.
## Complete Example
See our [comprehensive example](/docs/examples/virtual_scroll_example.py) that demonstrates:
- Twitter-like feeds
- Instagram grids
- Traditional infinite scroll
- Mixed content scenarios
- Performance comparisons
```bash
# Run the examples
cd docs/examples
python virtual_scroll_example.py
```
The example includes a local test server with different scrolling behaviors for experimentation.

View File

@@ -215,7 +215,7 @@ Below is a snippet combining many parameters:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
async def main():
# Example schema

Some files were not shown because too many files have changed in this diff Show More