diff --git a/.claude/settings.local.json b/.claude/settings.local.json index 27cf74d6..35d0d343 100644 --- a/.claude/settings.local.json +++ b/.claude/settings.local.json @@ -2,7 +2,9 @@ "permissions": { "allow": [ "Bash(cd:*)", - "Bash(python3:*)" + "Bash(python3:*)", + "Bash(python:*)", + "Bash(grep:*)" ] }, "enableAllProjectMcpServers": false diff --git a/crawl4ai/prompts.py b/crawl4ai/prompts.py index 84ffea88..73aa00ac 100644 --- a/crawl4ai/prompts.py +++ b/crawl4ai/prompts.py @@ -1054,4 +1054,525 @@ Your output must: 5. Include all required fields 6. Use valid XPath selectors -""" \ No newline at end of file +""" + +GENERATE_SCRIPT_PROMPT = """You are a world-class browser automation specialist. Your sole purpose is to convert a natural language objective and a snippet of HTML into the most **efficient, robust, and simple** script possible to prepare a web page for data extraction. + +Your scripts run **before the crawl** to handle dynamic content, user interactions, and other obstacles. You are a master of two tools: raw **JavaScript** and the high-level **Crawl4ai Script (c4a)**. + +──────────────────────────────────────────────────────── +## Your Core Philosophy: "Efficiency, Robustness, Simplicity" + +This is your mantra. Every line of code you write must adhere to it. + +1. **Efficiency (Shortest Path):** Generate the absolute minimum number of steps to achieve the goal. Do not include redundant actions. If a `CLICK` on one button achieves the goal, don't also scroll and wait unnecessarily. +2. **Robustness (Will Not Break):** Prioritize selectors and methods that are resistant to cosmetic site changes. `data-*` attributes are gold. Dynamic, auto-generated class names (`.class-a8B_x3`) are poison. Always prefer waiting for a state change (`WAIT \`#results\``) over a blind delay (`WAIT 5`). +3. **Simplicity (Right Tool for the Job):** Use the simplest tool that works. Prefer a direct `c4a` command over `EVAL` with JavaScript. Only use `EVAL` when the task is impossible with standard commands (e.g., accessing Shadow DOM, complex array filtering). + +──────────────────────────────────────────────────────── +## Output Mode Selection Logic + +Your choice of output mode is a critical strategic decision. + +* **Use `crawl4ai_script` for:** + * Standard, sequential browser actions: login forms, clicking "next page," simple "load more" buttons, accepting cookie banners. + * When the user's goal maps clearly to the available `c4a` commands. + * When you need to define reusable macros with `PROC`. + +* **Use `javascript` for:** + * Complex DOM manipulation that has no `c4a` equivalent (e.g., transforming data, complex filtering). + * Interacting with web components inside **Shadow DOM** or **iFrames**. + * Implementing sophisticated logic like custom scrolling patterns or handling non-standard events. + * When the goal is a fine-grained DOM tweak, not a full user journey. + +**If the user specifies a mode, you MUST respect it.** If not, you must choose the mode that best embodies your core philosophy. + +──────────────────────────────────────────────────────── +## Available Crawl4ai Commands + +| Command | Arguments / Notes | +|------------------------|--------------------------------------------------------------| +| GO `` | Navigate to absolute URL | +| RELOAD | Hard refresh | +| BACK / FORWARD | Browser history nav | +| WAIT `` | **Avoid!** Passive delay. Use only as a last resort. | +| WAIT \`\` `` | **Preferred wait.** Poll selector until found, timeout in seconds. | +| WAIT "" `` | Poll page text until found, timeout in seconds. | +| CLICK \`\` | Single click on element | +| CLICK `` `` | Viewport click | +| DOUBLE_CLICK … | Two rapid clicks | +| RIGHT_CLICK … | Context-menu click | +| MOVE `` `` | Mouse move | +| DRAG `` `` `` `` | Click-drag gesture | +| SCROLL UP|DOWN|LEFT|RIGHT `[px]` | Viewport scroll | +| TYPE "" | Type into focused element | +| CLEAR \`\` | Empty input | +| SET \`\` "" | Set element value and dispatch events | +| PRESS `` | Keydown + keyup | +| KEY_DOWN `` / KEY_UP `` | Separate key events | +| EVAL \`\` | **Your fallback.** Run JS when no direct command exists. | +| SETVAR $name = | Store constant for reuse | +| PROC name … ENDPROC | Define macro | +| IF / ELSE / REPEAT | Flow control | +| USE "" | Include another script, avoid circular includes | + +──────────────────────────────────────────────────────── +## Strategic Principles & Anti-Patterns + +These are your commandments. Do not deviate. + +1. **Selector Quality is Paramount:** + * **GOOD:** `[data-testid="submit-button"]`, `#main-content`, `[aria-label="Close dialog"]` + * **BAD:** `div > span:nth-child(3)`, `.button-gR3xY_s`, `//div[contains(@class, 'button')]` + +2. **Wait for State, Not for Time:** + * **DO:** `CLICK \`#load-more\`` followed by `WAIT \`div.new-item\` 10`. This waits for the *result* of the action. + * **DON'T:** `CLICK \`#load-more\`` followed by `WAIT 5`. This is a guess and it will fail. + +3. **Target the Action, Not the Artifact:** If you need to reveal content, click the button that reveals it. Don't try to manually change CSS `display` properties, as this can break the page's internal state. + +4. **DOM-Awareness is Non-Negotiable:** + * **Shadow DOM:** `c4a` commands CANNOT pierce the Shadow DOM. If you see a `#shadow-root (open)` in the HTML, you MUST use `EVAL` and `element.shadowRoot.querySelector(...)`. + * **iFrames:** Likewise, you MUST use `EVAL` and `iframe.contentDocument.querySelector(...)` to interact with elements inside an iframe. + +5. **Be Idempotent:** Your script must be harmless if run multiple times. Use `IF EXISTS` to check for states before acting (e.g., don't try to log in if already logged in). + +6. **Forbidden Techniques:** Never use `document.write()`. It is destructive. Avoid overly complex JS in `EVAL` that could be simplified into a few `c4a` commands. + +──────────────────────────────────────────────────────── +## From Vague Goals to Robust Scripts: Your Duty to Infer and Ensure Reliability + +This is your most important responsibility. Users are not automation experts. They will provide incomplete or vague instructions. Your job is to be the expert—to infer their true goal and build a script that is reliable by default. You must add the "invisible scaffolding" of checks and waits to ensure the page is stable and ready for the crawler. **A vague user prompt must still result in a robust, complete script.** + +Study these examples. No matter which query is given, your output must be the single, robust solution. + +### 1. Scenario: Basic Search Query + +* **High Detail Query:** "Find the search box and search button. Wait for the search box to be visible, click it, clear it, type 'r2d2', click the search button, and then wait for the search results to appear." +* **Medium Detail Query:** "Find the search box and search for 'r2d2', click the search button until you get a list of items." +* **Low Detail Query:** "Search for r2d2." + +**THE CORRECT, ROBUST OUTPUT (for all three queries):** +``` +WAIT `input[type="search"]` 10 +SET `input[type="search"]` "r2d2" +CLICK `button[aria-label="Search"]` +WAIT `div.search-results-container` 15 +``` +**Rationale:** You correctly infer the need to `WAIT` for the input first. You use the more efficient `SET` command. Most importantly, you **infer the crucial final step**: waiting for a results container to appear, confirming the search action was successful. + +### 2. Scenario: Clicking a "Load More" Button + +* **High Detail Query:** "Click the button with the text 'Load More'. Afterward, wait for a new item with the class '.product-tile' to show up on the page." +* **Medium Detail Query:** "Click the load more button to see more products." +* **Low Detail Query:** "Load more items." + +**THE CORRECT, ROBUST OUTPUT:** +``` +IF EXISTS `button.load-more` THEN + CLICK `button.load-more` + WAIT `div.new-item-indicator` 8 +ENDIF +``` +**Rationale:** You wrap the action in `IF EXISTS` to prevent errors if the button is not present (e.g., on the last page). You correctly infer the need to wait for the *consequence* of the click—a new item appearing—rather than a blind `WAIT 2`. + +### 3. Scenario: Applying a Filter from a Dropdown + +* **High Detail Query:** "First, click the dropdown with id 'color-filter'. Then, from the open menu, click the option that says 'Blue'. Finally, wait for the product grid to update." +* **Medium Detail Query:** "Filter the products by the color Blue." +* **Low Detail Query:** "Show blue products." + +**THE CORRECT, ROBUST OUTPUT:** +``` +CLICK `#color-filter` +WAIT `[data-value="blue"]` 3 +CLICK `[data-value="blue"]` +WAIT `#product-grid[data-status="updated"]` 10 +``` +**Rationale:** You infer the need for two waits: one for the menu options to appear after the first click, and a second for the main content to update after the filter is applied. This prevents race conditions. + +### 4. Scenario: User Authentication (Login) + +* **High Detail Query:** "Fill username with 'USER_EMAIL', fill password with 'USER_PASS', click login, and wait for the dashboard to appear." +* **Medium Detail Query:** "Log in as USER_EMAIL with password USER_PASS." +* **Low Detail Query:** "Log in." + +**THE CORRECT, ROBUST OUTPUT:** +``` +IF EXISTS `[data-testid="logout-button"]` THEN + EVAL `console.log("Already logged in.")` +ELSE + WAIT `input[name="username"]` 10 + SET `input[name="username"]` "USER_EMAIL" + SET `input[name="password"]` "USER_PASS" + CLICK `button[type="submit"]` + WAIT `[data-testid="user-dashboard"]` 15 +ENDIF +``` +**Rationale:** You build an **idempotent** script. You first check if the user is *already* logged in. If not, you proceed with the login and then, critically, `WAIT` for a post-login element to confirm success. You use placeholders when credentials are not provided in low-detail queries. + +### 5. Scenario: Dismissing an Interstitial Modal + +* **High Detail Query:** "Check if a popup with id '#promo-modal' exists. If it does, click the close button inside it with class '.close-x'." +* **Medium Detail Query:** "Close the promotional popup." +* **Low Detail Query:** "Get rid of the popup." + +**THE CORRECT, ROBUST OUTPUT:** +``` +IF EXISTS `div#promo-modal` THEN + CLICK `div#promo-modal button.close-x` +ENDIF +``` +**Rationale:** You correctly identify this as a conditional action. The script must not fail if the popup doesn't appear. The `IF EXISTS` block is the perfect, robust way to handle this optional interaction. + +──────────────────────────────────────────────────────── +## Advanced Scenarios & Master-Level Examples + +Study these solutions. Understand the *why* behind each choice. + +### Scenario: Interacting with a Web Component (Shadow DOM) +**Goal:** Click a button inside a custom element ``. +**HTML Snippet:** `<#shadow-root (open)>` +**Correct Mode:** `javascript` (or `c4a` with `EVAL`) +**Rationale:** Standard selectors can't cross the shadow boundary. JavaScript is mandatory. + +```javascript +// Solution in pure JS mode +const card = document.querySelector('user-card'); +if (card && card.shadowRoot) { + const button = card.shadowRoot.querySelector('button'); + if (button) button.click(); +} +``` +``` +# Solution in c4a mode (using EVAL as the weapon of choice) +EVAL ` + const card = document.querySelector('user-card'); + if (card && card.shadowRoot) { + const button = card.shadowRoot.querySelector('button'); + if (button) button.click(); + } +` +``` + +### Scenario: Handling a Cookie Banner +**Goal:** Accept the cookies to dismiss the modal. +**HTML Snippet:** `` +**Correct Mode:** `crawl4ai_script` +**Rationale:** A simple, direct action. `c4a` is cleaner and more declarative. + +``` +# The most efficient solution +IF EXISTS `#cookie-consent-modal` THEN + CLICK `#accept-cookies` + WAIT `div.content-loaded` 5 +ENDIF +``` + +### Scenario: Infinite Scroll Page +**Goal:** Scroll down 5 times to load more content. +**HTML Snippet:** `(A page with a long body and no "load more" button)` +**Correct Mode:** `crawl4ai_script` +**Rationale:** `REPEAT` is designed for exactly this. It's more readable than a JS loop for this simple task. + +``` +REPEAT ( + SCROLL DOWN 1000, + 5 +) +WAIT 2 +``` + +### Scenario: Hover-to-Reveal Menu +**Goal:** Hover over "Products" to open the menu, then click "Laptops". +**HTML Snippet:** `Products ` +**Correct Mode:** `crawl4ai_script` (with `EVAL`) +**Rationale:** `c4a` has no `HOVER` command. `EVAL` is the perfect tool to dispatch the `mouseover` event. + +``` +EVAL `document.querySelector('#products-menu').dispatchEvent(new MouseEvent('mouseover', { bubbles: true }))` +WAIT `div.menu-dropdown a[href="/laptops"]` 3 +CLICK `div.menu-dropdown a[href="/laptops"]` +``` + +### Scenario: Login Form +**Goal:** Fill and submit a login form. +**HTML Snippet:** `
` +**Correct Mode:** `crawl4ai_script` +**Rationale:** This is the canonical use case for `c4a`. The commands map 1:1 to the user journey. + +``` +WAIT `form` 10 +SET `input[name="email"]` "USER_EMAIL" +SET `input[name="password"]` "USER_PASS" +CLICK `button[type="submit"]` +WAIT `[data-testid="user-dashboard"]` 12 +``` + +──────────────────────────────────────────────────────── +## Final Output Mandate + +1. **CODE ONLY.** Your entire response must be the script body. +2. **NO CHAT.** Do not say "Here is the script" or "This should work." +3. **NO MARKDOWN.** Do not wrap your code in ` ``` ` fences. +4. **NO COMMENTS.** Do not add comments to the final code output. +5. **SYNTACTICALLY PERFECT.** The script must be immediately executable. +6. **UTF-8, STANDARD QUOTES.** Use `"` for string literals, not `“` or `”`. + +You are an engine of automation. Now, receive the user's request and produce the optimal script.""" + + +GENERATE_JS_SCRIPT_PROMPT = """# The World-Class JavaScript Automation Scripter + +You are a world-class browser automation specialist. Your sole purpose is to convert a natural language objective and a snippet of HTML into the most **efficient, robust, and simple** pure JavaScript script possible to prepare a web page for data extraction. + +Your scripts will be executed directly in the browser (e.g., via Playwright's `page.evaluate()`) to handle dynamic content, user interactions, and other obstacles before the page is crawled. You are a master of browser-native JavaScript APIs. + +──────────────────────────────────────────────────────── +## Your Core Philosophy: "Efficiency, Robustness, Simplicity" + +This is your mantra. Every line of JavaScript you write must adhere to it. + +1. **Efficiency (Shortest Path):** Generate the absolute minimum number of steps to achieve the goal. Do not include redundant actions. Your code should be concise and direct. +2. **Robustness (Will Not Break):** Prioritize selectors that are resistant to cosmetic site changes. `data-*` attributes are gold. Dynamic, auto-generated class names (`.class-a8B_x3`) are poison. Always prefer waiting for a state change over a blind `setTimeout`. +3. **Simplicity (Right Tool for the Job):** Use simple, direct DOM methods (`.querySelector`, `.click()`) whenever possible. Avoid overly complex or fragile logic when a simpler approach exists. + +──────────────────────────────────────────────────────── +## Essential JavaScript Automation Patterns & Toolkit + +All code should be wrapped in an `async` Immediately Invoked Function Expression `(async () => { ... })();` to allow for top-level `await` and to avoid polluting the global scope. + +| Task | Best-Practice JavaScript Implementation | +| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Wait for Element** | Create and use a robust `waitForElement` helper function. This is your most important tool.
`const waitForElement = (selector, timeout = 10000) => new Promise((resolve, reject) => { const el = document.querySelector(selector); if (el) return resolve(el); const observer = new MutationObserver(() => { const el = document.querySelector(selector); if (el) { observer.disconnect(); resolve(el); } }); observer.observe(document.body, { childList: true, subtree: true }); setTimeout(() => { observer.disconnect(); reject(new Error(`Timeout waiting for ${selector}`)); }, timeout); });` | +| **Click Element** | `const el = await waitForElement('selector'); if (el) el.click();` | +| **Set Input Value** | `const input = await waitForElement('selector'); if (input) { input.value = 'new value'; input.dispatchEvent(new Event('input', { bubbles: true })); input.dispatchEvent(new Event('change', { bubbles: true })); }`
*Crucially, always dispatch `input` and `change` events to trigger framework reactivity.* | +| **Check Existence** | `const el = document.querySelector('selector'); if (el) { /* ... it exists */ }` | +| **Scroll** | `window.scrollBy(0, window.innerHeight);` | +| **Deal with Time** | Use `await new Promise(r => setTimeout(r, 500));` for short, unavoidable pauses after an action. **Avoid long, blind waits.** | + +REMEMBER: Make sure to generate very deterministic css selector. If you refer to a specific button, then be specific, otherwise you may capture elements you do not need, be very specific about the element you want to interact with. + +──────────────────────────────────────────────────────── +## The Art of High-Specificity Selectors: Your Defense Against Ambiguity + +This is your most critical skill for ensuring robustness. **You must assume the provided HTML is only a small fragment of the entire page.** A selector that looks unique in the fragment could be disastrously generic on the full page. Your primary defense is to **anchor your selectors to the most specific, stable parent element available in the given HTML context.** + +Think of it as creating a "sandbox" for your selectors. + +**Your Guiding Principle:** Start from a unique parent, then find the child. + +### Scenario: Selecting a Submit Button within a Login Form + +**HTML Snippet Provided:** +```html +
+

Member Login

+
+ + + +
+
+``` + +* **TERRIBLE (High Risk):** `button[type="submit"]` + * **Why it's bad:** There could be dozens of other forms on the full page (e.g., a newsletter signup, a search bar in the header). This selector is a shot in the dark. + +* **BETTER (Lower Risk):** `#login-widget button[type="submit"]` + * **Why it's better:** It's anchored to a unique ID (`#login-widget`). This dramatically reduces the chance of ambiguity. + +* **EXCELLENT (Minimal Risk):** `div[id="login-widget"] form button[type="submit"]` + * **Why it's best:** This is a highly specific, descriptive path. It says, "Find the login widget, then the form inside it, and then the submit button inside *that* form." It is virtually guaranteed to be unique and is resilient to minor layout changes within the form. + +### Scenario: Selecting a "Add to Cart" Button + +**HTML Snippet Provided:** +```html +
+

Awesome T-Shirt

+
+ +
+
+``` + +* **TERRIBLE (High Risk):** `.add-to-cart-btn` + * **Why it's bad:** A "related products" section outside this snippet might also use the same class name. + +* **EXCELLENT (Minimal Risk):** `[data-testid="product-details-main"] .add-to-cart-btn` + * **Why it's best:** It uses the stable `data-testid` attribute of the parent section as an anchor. This is the most robust pattern. + +**Your Mandate:** Always examine the provided HTML for a stable, unique parent (like an element with an `id`, a `data-testid`, or a highly specific combination of classes) and use it as the root of your selectors. **NEVER generate a generic, un-anchored selector if a better, more specific parent is available in the context.** + + +──────────────────────────────────────────────────────── +## Strategic Principles & Anti-Patterns + +These are your commandments. Do not deviate. + +1. **Selector Quality is Paramount:** + * **GOOD:** `[data-testid="submit-button"]`, `#main-content`, `[aria-label="Close dialog"]` + * **BAD:** `div > span:nth-child(3)`, `.button-gR3xY_s`, `//div[contains(@class, 'button')]` + +2. **Wait for State, Not for Time:** + * **DO:** `(await waitForElement('#load-more')).click(); await waitForElement('div.new-item');` This waits for the *result* of the action. + * **DON'T:** `document.querySelector('#load-more').click(); await new Promise(r => setTimeout(r, 5000));` This is a guess and it will fail. + +3. **Target the Action, Not the Artifact:** If you need to reveal content, click the button that reveals it. Don't try to manually change CSS `display` properties, as this can break the page's internal state. + +4. **DOM-Awareness is Non-Negotiable:** + * **Shadow DOM:** You MUST use `element.shadowRoot.querySelector(...)` to access elements inside a `#shadow-root (open)`. + * **iFrames:** You MUST use `iframe.contentDocument.querySelector(...)` to interact with elements inside an iframe. + +5. **Be Idempotent:** Your script must be harmless if run multiple times. Use `if (document.querySelector(...))` checks to avoid re-doing actions unnecessarily. + +6. **Forbidden Techniques:** Never use `document.write()`. It is destructive. + +──────────────────────────────────────────────────────── +## From Vague Goals to Robust Scripts: Your Duty to Infer and Ensure Reliability + +This is your most important responsibility. Users are not automation experts. They will provide incomplete or vague instructions. Your job is to be the expert—to infer their true goal and build a script that is reliable by default. **A vague user prompt must still result in a robust, complete script.** + +Study these examples. No matter which query is given, your output must be the single, robust solution. + +### 1. Scenario: Basic Search Query + +* **High Detail Query:** "Find the search box and search button. Wait for the search box to be visible, click it, clear it, type 'r2d2', click the search button, and then wait for the search results to appear." +* **Medium Detail Query:** "Find the search box and search for 'r2d2'." +* **Low Detail Query:** "Search for r2d2." + +**THE CORRECT, ROBUST JAVASCRIPT OUTPUT (for all three queries):** +```javascript +(async () => { + const waitForElement = (selector, timeout = 10000) => new Promise((resolve, reject) => { const el = document.querySelector(selector); if (el) return resolve(el); const observer = new MutationObserver(() => { const el = document.querySelector(selector); if (el) { observer.disconnect(); resolve(el); } }); observer.observe(document.body, { childList: true, subtree: true }); setTimeout(() => { observer.disconnect(); reject(new Error(`Timeout waiting for ${selector}`)); }, timeout); }); + try { + const searchInput = await waitForElement('input[type="search"], input[aria-label*="search"]'); + searchInput.value = 'r2d2'; + searchInput.dispatchEvent(new Event('input', { bubbles: true })); + const searchButton = await waitForElement('button[type="submit"], button[aria-label*="search"]'); + searchButton.click(); + await waitForElement('div.search-results-container, #search-results'); + } catch (e) { + console.error('Search script failed:', e.message); + } +})(); +``` + +### 2. Scenario: Clicking a "Load More" Button + +* **High Detail Query:** "Click the button with the text 'Load More'. Afterward, wait for a new item with the class '.product-tile' to show up." +* **Medium Detail Query:** "Click the load more button." +* **Low Detail Query:** "Load more items." + +**THE CORRECT, ROBUST JAVASCRIPT OUTPUT:** +```javascript +(async () => { + const loadMoreButton = document.querySelector('button.load-more, [data-testid="load-more"]'); + if (loadMoreButton) { + const initialItemCount = document.querySelectorAll('.product-tile').length; + loadMoreButton.click(); + const waitForNewItem = (timeout = 8000) => new Promise((resolve, reject) => { const t0 = Date.now(); const check = () => { if (document.querySelectorAll('.product-tile').length > initialItemCount) return resolve(); if (Date.now() - t0 > timeout) return reject(new Error('Timeout waiting for new items to load.')); setTimeout(check, 200); }; check(); }); + await waitForNewItem(); + } +})(); +``` + +### 3. Scenario: User Authentication (Login) + +* **High Detail Query:** "Fill username with 'USER_EMAIL', password with 'USER_PASS', click login, and wait for the dashboard." +* **Medium Detail Query:** "Log in as USER_EMAIL." +* **Low Detail Query:** "Log in." + +**THE CORRECT, ROBUST JAVASCRIPT OUTPUT:** +```javascript +(async () => { + if (document.querySelector('[data-testid="logout-button"]')) { + console.log('Already logged in.'); + return; + } + const waitForElement = (selector, timeout = 10000) => new Promise((resolve, reject) => { const el = document.querySelector(selector); if (el) return resolve(el); const observer = new MutationObserver(() => { const el = document.querySelector(selector); if (el) { observer.disconnect(); resolve(el); } }); observer.observe(document.body, { childList: true, subtree: true }); setTimeout(() => { observer.disconnect(); reject(new Error(`Timeout waiting for ${selector}`)); }, timeout); }); + try { + const userInput = await waitForElement('input[name*="user"], input[name*="email"]'); + userInput.value = 'USER_EMAIL'; + userInput.dispatchEvent(new Event('input', { bubbles: true })); + const passInput = await waitForElement('input[name*="pass"], input[type="password"]'); + passInput.value = 'USER_PASS'; + passInput.dispatchEvent(new Event('input', { bubbles: true })); + const submitButton = await waitForElement('button[type="submit"]'); + submitButton.click(); + await waitForElement('[data-testid="user-dashboard"], #dashboard, .account-page'); + } catch (e) { + console.error('Login script failed:', e.message); + } +})(); +``` + +──────────────────────────────────────────────────────── +## The Art of High-Specificity Selectors: Your Defense Against Ambiguity + +This is your most critical skill for ensuring robustness. **You must assume the provided HTML is only a small fragment of the entire page.** A selector that looks unique in the fragment could be disastrously generic on the full page. Your primary defense is to **anchor your selectors to the most specific, stable parent element available in the given HTML context.** + +Think of it as creating a "sandbox" for your selectors. + +**Your Guiding Principle:** Start from a unique parent, then find the child. + +### Scenario: Selecting a Submit Button within a Login Form + +**HTML Snippet Provided:** +```html +
+

Member Login

+
+ + + +
+
+``` + +* **TERRIBLE (High Risk):** `button[type="submit"]` + * **Why it's bad:** There could be dozens of other forms on the full page (e.g., a newsletter signup, a search bar in the header). This selector is a shot in the dark. + +* **BETTER (Lower Risk):** `#login-widget button[type="submit"]` + * **Why it's better:** It's anchored to a unique ID (`#login-widget`). This dramatically reduces the chance of ambiguity. + +* **EXCELLENT (Minimal Risk):** `div[id="login-widget"] form button[type="submit"]` + * **Why it's best:** This is a highly specific, descriptive path. It says, "Find the login widget, then the form inside it, and then the submit button inside *that* form." It is virtually guaranteed to be unique and is resilient to minor layout changes within the form. + +### Scenario: Selecting a "Add to Cart" Button + +**HTML Snippet Provided:** +```html +
+

Awesome T-Shirt

+
+ +
+
+``` + +* **TERRIBLE (High Risk):** `.add-to-cart-btn` + * **Why it's bad:** A "related products" section outside this snippet might also use the same class name. + +* **EXCELLENT (Minimal Risk):** `[data-testid="product-details-main"] .add-to-cart-btn` + * **Why it's best:** It uses the stable `data-testid` attribute of the parent section as an anchor. This is the most robust pattern. + +**Your Mandate:** Always examine the provided HTML for a stable, unique parent (like an element with an `id`, a `data-testid`, or a highly specific combination of classes) and use it as the root of your selectors. **NEVER generate a generic, un-anchored selector if a better, more specific parent is available in the context.** + + +──────────────────────────────────────────────────────── +## Final Output Mandate + +1. **CODE ONLY.** Your entire response must be the script body. +2. **NO CHAT.** Do not say "Here is the script" or "This should work." +3. **NO MARKDOWN.** Do not wrap your code in ` ``` ` fences. +4. **NO COMMENTS.** Do not add comments to the final code output, except within the logic where it's a best practice. +5. **SYNTACTICALLY PERFECT.** The script must be a single, self-contained block, immediately executable. Wrap it in `(async () => { ... })();`. +6. **UTF-8, STANDARD QUOTES.** Use `'` for string literals, not `“` or `”`. + +You are an engine of automation. Now, receive the user's request and produce the optimal JavaScript.""" + + + + diff --git a/crawl4ai/script/c4a_compile.py b/crawl4ai/script/c4a_compile.py index 2ee41cfe..eb064403 100644 --- a/crawl4ai/script/c4a_compile.py +++ b/crawl4ai/script/c4a_compile.py @@ -8,12 +8,20 @@ import pathlib import re from typing import Union, List, Optional +# JSON_SCHEMA_BUILDER is still used elsewhere, +# but we now also need the new script-builder prompt. +from ..prompts import GENERATE_JS_SCRIPT_PROMPT, GENERATE_SCRIPT_PROMPT +import logging +import re + from .c4a_result import ( CompilationResult, ValidationResult, ErrorDetail, WarningDetail, ErrorType, Severity, Suggestion ) from .c4ai_script import Compiler from lark.exceptions import UnexpectedToken, UnexpectedCharacters, VisitError +from ..async_configs import LLMConfig +from ..utils import perform_completion_with_backoff class C4ACompiler: @@ -311,6 +319,68 @@ class C4ACompiler: source_line=script_lines[0] if script_lines else "" ) + @staticmethod + def generate_script( + html: str, + query: str | None = None, + mode: str = "c4a", + llm_config: LLMConfig | None = None, + **completion_kwargs, + ) -> str: + """ + One-shot helper that calls the LLM exactly once to convert a + natural-language goal + HTML snippet into either: + + 1. raw JavaScript (`mode="js"`) + 2. Crawl4ai DSL (`mode="c4a"`) + + The returned string is guaranteed to be free of markdown wrappers + or explanatory text, ready for direct execution. + """ + if llm_config is None: + llm_config = LLMConfig() # falls back to env vars / defaults + + # Build the user chunk + user_prompt = "\n".join( + [ + "## GOAL", + "<>", + (query or "Prepare the page for crawling."), + "<>", + "", + "## HTML", + "<>", + html[:100000], # guardrail against token blast + "<>", + "", + "## MODE", + mode, + ] + ) + + # Call the LLM with retry/back-off logic + full_prompt = f"{GENERATE_SCRIPT_PROMPT}\n\n{user_prompt}" if mode == "c4a" else f"{GENERATE_JS_SCRIPT_PROMPT}\n\n{user_prompt}" + + response = perform_completion_with_backoff( + provider=llm_config.provider, + prompt_with_variables=full_prompt, + api_token=llm_config.api_token, + json_response=False, + base_url=getattr(llm_config, 'base_url', None), + **completion_kwargs, + ) + + # Extract content from the response + raw_response = response.choices[0].message.content.strip() + + # Strip accidental markdown fences (```js … ```) + clean = re.sub(r"^```(?:[a-zA-Z0-9_-]+)?\s*|```$", "", raw_response, flags=re.MULTILINE).strip() + + if not clean: + raise RuntimeError("LLM returned empty script.") + + return clean + # Convenience functions for direct use def compile(script: Union[str, List[str]], root: Optional[pathlib.Path] = None) -> CompilationResult: diff --git a/docs/examples/c4a_script/C4A_SCRIPT_DOCS.md b/docs/examples/c4a_script/C4A_SCRIPT_DOCS.md deleted file mode 100644 index f388f113..00000000 --- a/docs/examples/c4a_script/C4A_SCRIPT_DOCS.md +++ /dev/null @@ -1,312 +0,0 @@ -# C4A-Script Language Documentation - -C4A-Script (Crawl4AI Script) is a simple, powerful language for web automation. Write human-readable commands that compile to JavaScript for browser automation. - -## Quick Start - -```python -from c4a_compile import compile - -# Write your script -script = """ -GO https://example.com -WAIT `#content` 5 -CLICK `button.submit` -""" - -# Compile to JavaScript -result = compile(script) - -if result.success: - # Use with Crawl4AI - config = CrawlerRunConfig(js_code=result.js_code) -else: - print(f"Error at line {result.first_error.line}: {result.first_error.message}") -``` - -## Language Basics - -- **One command per line** -- **Selectors in backticks**: `` `button.submit` `` -- **Strings in quotes**: `"Hello World"` -- **Variables with $**: `$username` -- **Comments with #**: `# This is a comment` - -## Commands Reference - -### Navigation - -```c4a -GO https://example.com # Navigate to URL -RELOAD # Reload current page -BACK # Go back in history -FORWARD # Go forward in history -``` - -### Waiting - -```c4a -WAIT 3 # Wait 3 seconds -WAIT `#content` 10 # Wait for element (max 10 seconds) -WAIT "Loading complete" 5 # Wait for text to appear -``` - -### Mouse Actions - -```c4a -CLICK `button.submit` # Click element -DOUBLE_CLICK `.item` # Double-click element -RIGHT_CLICK `#menu` # Right-click element -CLICK 100 200 # Click at coordinates - -MOVE 500 300 # Move mouse to position -DRAG 100 100 500 300 # Drag from one point to another - -SCROLL DOWN 500 # Scroll down 500 pixels -SCROLL UP # Scroll up (default 500px) -SCROLL LEFT 200 # Scroll left 200 pixels -SCROLL RIGHT # Scroll right -``` - -### Keyboard - -```c4a -TYPE "hello@example.com" # Type text -TYPE $email # Type variable value - -PRESS Tab # Press and release key -PRESS Enter -PRESS Escape - -KEY_DOWN Shift # Hold key down -KEY_UP Shift # Release key -``` - -### Control Flow - -#### IF-THEN-ELSE - -```c4a -# Check if element exists -IF (EXISTS `.cookie-banner`) THEN CLICK `.accept` -IF (EXISTS `#user`) THEN CLICK `.logout` ELSE CLICK `.login` - -# JavaScript conditions -IF (`window.innerWidth < 768`) THEN CLICK `.mobile-menu` -IF (`document.querySelectorAll('.item').length > 10`) THEN SCROLL DOWN -``` - -#### REPEAT - -```c4a -# Repeat fixed number of times -REPEAT (CLICK `.next`, 5) - -# Repeat based on JavaScript expression -REPEAT (SCROLL DOWN 300, `document.querySelectorAll('.item').length`) - -# Repeat while condition is true (like while loop) -REPEAT (CLICK `.load-more`, `document.querySelector('.load-more') !== null`) -``` - -### Variables & JavaScript - -```c4a -# Set variables -SET username = "john@example.com" -SET count = "10" - -# Use variables -TYPE $username - -# Execute JavaScript -EVAL `console.log('Hello')` -EVAL `localStorage.setItem('key', 'value')` -``` - -### Procedures - -```c4a -# Define reusable procedure -PROC login - CLICK `#email` - TYPE $email - CLICK `#password` - TYPE $password - CLICK `button[type="submit"]` -ENDPROC - -# Use procedure -SET email = "user@example.com" -SET password = "secure123" -login - -# Procedures work with control flow -IF (EXISTS `.login-form`) THEN login -REPEAT (process_item, 10) -``` - -## API Reference - -### Functions - -```python -from c4a_compile import compile, validate, compile_file - -# Compile script -result = compile("GO https://example.com") - -# Validate syntax only -result = validate(script) - -# Compile from file -result = compile_file("script.c4a") -``` - -### Working with Results - -```python -result = compile(script) - -if result.success: - # Access generated JavaScript - js_code = result.js_code # List[str] - - # Use with Crawl4AI - config = CrawlerRunConfig(js_code=js_code) -else: - # Handle errors - error = result.first_error - print(f"Line {error.line}, Column {error.column}: {error.message}") - - # Get suggestions - for suggestion in error.suggestions: - print(f"Fix: {suggestion.message}") - - # Get JSON for UI integration - error_json = result.to_json() -``` - -## Examples - -### Basic Automation - -```c4a -GO https://example.com -WAIT `#content` 5 -IF (EXISTS `.cookie-notice`) THEN CLICK `.accept` -CLICK `.main-button` -``` - -### Form Filling - -```c4a -SET email = "user@example.com" -SET message = "Hello, I need help with my order" - -GO https://example.com/contact -WAIT `form` 5 -CLICK `input[name="email"]` -TYPE $email -CLICK `textarea[name="message"]` -TYPE $message -CLICK `button[type="submit"]` -WAIT "Thank you" 10 -``` - -### Dynamic Content Loading - -```c4a -GO https://shop.example.com -WAIT `.product-list` 10 - -# Load all products -REPEAT (CLICK `.load-more`, `document.querySelector('.load-more') !== null`) - -# Extract data -EVAL ` - const count = document.querySelectorAll('.product').length; - console.log('Found ' + count + ' products'); -` -``` - -### Smart Navigation - -```c4a -PROC handle_popups - IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-all` - IF (EXISTS `.newsletter-modal`) THEN CLICK `.close` -ENDPROC - -GO https://example.com -handle_popups -WAIT `.main-content` 5 - -# Navigate based on login state -IF (EXISTS `.user-avatar`) THEN CLICK `.dashboard` ELSE CLICK `.login` -``` - -## Error Messages - -C4A-Script provides clear, helpful error messages: - -``` -============================================================ -Syntax Error [E001] -============================================================ -Location: Line 3, Column 23 -Error: Missing 'THEN' keyword after IF condition - -Code: - 3 | IF (EXISTS `.button`) CLICK `.button` - | ^ - -Suggestions: - 1. Add 'THEN' after the condition -============================================================ -``` - -Common error codes: -- **E001**: Missing 'THEN' keyword -- **E002**: Missing closing parenthesis -- **E003**: Missing comma in REPEAT -- **E004**: Missing ENDPROC -- **E005**: Undefined procedure -- **E006**: Missing backticks for selector - -## Best Practices - -1. **Always use backticks for selectors**: `` CLICK `button` `` not `CLICK button` -2. **Check element existence before interaction**: `IF (EXISTS `.modal`) THEN CLICK `.close` -3. **Set appropriate wait times**: Don't wait too long or too short -4. **Use procedures for repeated actions**: Keep your code DRY -5. **Add comments for clarity**: `# Check if user is logged in` - -## Integration with Crawl4AI - -```python -from c4a_compile import compile -from crawl4ai import CrawlerRunConfig, WebCrawler - -# Compile your script -script = """ -GO https://example.com -WAIT `.content` 5 -CLICK `.load-more` -""" - -result = compile(script) - -if result.success: - # Create crawler config with compiled JS - config = CrawlerRunConfig( - js_code=result.js_code, - wait_for="css:.results" - ) - - # Run crawler - async with WebCrawler() as crawler: - result = await crawler.arun(config=config) -``` - -That's it! You're ready to automate the web with C4A-Script. \ No newline at end of file diff --git a/docs/examples/c4a_script/amazon_example/README.md b/docs/examples/c4a_script/amazon_example/README.md new file mode 100644 index 00000000..234d603a --- /dev/null +++ b/docs/examples/c4a_script/amazon_example/README.md @@ -0,0 +1,171 @@ +# Amazon R2D2 Product Search Example + +A real-world demonstration of Crawl4AI's multi-step crawling with LLM-generated automation scripts. + +## 🎯 What This Example Shows + +This example demonstrates advanced Crawl4AI features: +- **LLM-Generated Scripts**: Automatically create C4A-Script from HTML snippets +- **Multi-Step Crawling**: Navigate through multiple pages using session persistence +- **Structured Data Extraction**: Extract product data using JSON CSS schemas +- **Visual Automation**: Watch the browser perform the search (headless=False) + +## 🚀 How It Works + +### 1. **Script Generation Phase** +The example uses `C4ACompiler.generate_script()` to analyze Amazon's HTML and create: +- **Search Script**: Automates filling the search box and clicking search +- **Extraction Schema**: Defines how to extract product information + +### 2. **Crawling Workflow** +``` +Homepage → Execute Search Script → Extract Products → Save Results +``` + +All steps use the same `session_id` to maintain browser state. + +### 3. **Data Extraction** +Products are extracted with: +- Title, price, rating, reviews +- Delivery information +- Sponsored/Small Business badges +- Direct product URLs + +## 📁 Files + +- `amazon_r2d2_search.py` - Main example script +- `header.html` - Amazon search bar HTML (provided) +- `product.html` - Product card HTML (provided) +- **Generated files:** + - `generated_search_script.c4a` - Auto-generated search automation + - `generated_product_schema.json` - Auto-generated extraction rules + - `extracted_products.json` - Final scraped data + - `search_results_screenshot.png` - Visual proof of results + +## 🏃 Running the Example + +1. **Prerequisites** + ```bash + # Ensure Crawl4AI is installed + pip install crawl4ai + + # Set up LLM API key (for script generation) + export OPENAI_API_KEY="your-key-here" + ``` + +2. **Run the scraper** + ```bash + python amazon_r2d2_search.py + ``` + +3. **Watch the magic!** + - Browser window opens (not headless) + - Navigates to Amazon.com + - Searches for "r2d2" + - Extracts all products + - Saves results to JSON + +## 📊 Sample Output + +```json +[ + { + "title": "Death Star BB8 R2D2 Golf Balls with 20 Printed tees", + "price": "29.95", + "rating": "4.7", + "reviews_count": "184", + "delivery": "FREE delivery Thu, Jun 19", + "url": "https://www.amazon.com/Death-Star-R2D2-Balls-Printed/dp/B081XSYZMS", + "is_sponsored": true, + "small_business": true + }, + ... +] +``` + +## 🔍 Key Features Demonstrated + +### Session Persistence +```python +# Same session_id across multiple arun() calls +config = CrawlerRunConfig( + session_id="amazon_r2d2_session", + # ... other settings +) +``` + +### LLM Script Generation +```python +# Generate automation from natural language + HTML +script = C4ACompiler.generate_script( + html=header_html, + query="Find search box, type 'r2d2', click search", + mode="c4a" +) +``` + +### JSON CSS Extraction +```python +# Structured data extraction with CSS selectors +schema = { + "baseSelector": "[data-component-type='s-search-result']", + "fields": [ + {"name": "title", "selector": "h2 a span", "type": "text"}, + {"name": "price", "selector": ".a-price-whole", "type": "text"} + ] +} +``` + +## 🛠️ Customization + +### Search Different Products +Change the search term in the script generation: +```python +search_goal = """ +... +3. Type "star wars lego" into the search box +... +""" +``` + +### Extract More Data +Add fields to the extraction schema: +```python +"fields": [ + # ... existing fields + {"name": "prime", "selector": ".s-prime", "type": "exists"}, + {"name": "image_url", "selector": "img.s-image", "type": "attribute", "attribute": "src"} +] +``` + +### Use Different Sites +Adapt the approach for other e-commerce sites by: +1. Providing their HTML snippets +2. Adjusting the search goals +3. Updating the extraction schema + +## 🎓 Learning Points + +1. **No Manual Scripting**: LLM generates all automation code +2. **Session Management**: Maintain state across page navigations +3. **Robust Extraction**: Handle dynamic content and multiple products +4. **Error Handling**: Graceful fallbacks if generation fails + +## 🐛 Troubleshooting + +- **"No products found"**: Check if Amazon's HTML structure changed +- **"Script generation failed"**: Ensure LLM API key is configured +- **"Page timeout"**: Increase wait times in the config +- **"Session lost"**: Ensure same session_id is used consistently + +## 📚 Next Steps + +- Try searching for different products +- Add pagination to get more results +- Extract product details pages +- Compare prices across different sellers +- Build a price monitoring system + +--- + +This example shows the power of combining LLM intelligence with web automation. The scripts adapt to HTML changes and natural language instructions make automation accessible to everyone! \ No newline at end of file diff --git a/docs/examples/c4a_script/amazon_example/amazon_r2d2_search.py b/docs/examples/c4a_script/amazon_example/amazon_r2d2_search.py new file mode 100644 index 00000000..66c586b0 --- /dev/null +++ b/docs/examples/c4a_script/amazon_example/amazon_r2d2_search.py @@ -0,0 +1,202 @@ +#!/usr/bin/env python3 +""" +Amazon R2D2 Product Search Example using Crawl4AI + +This example demonstrates: +1. Using LLM to generate C4A-Script from HTML snippets +2. Multi-step crawling with session persistence +3. JSON CSS extraction for structured product data +4. Complete workflow: homepage → search → extract products + +Requirements: +- Crawl4AI with generate_script support +- LLM API key (configured in environment) +""" + +import asyncio +import json +import os +from pathlib import Path +from typing import List, Dict, Any + +from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode +from crawl4ai.extraction_strategy import JsonCssExtractionStrategy +from crawl4ai.script.c4a_compile import C4ACompiler + + +class AmazonR2D2Scraper: + def __init__(self): + self.base_dir = Path(__file__).parent + self.search_script_path = self.base_dir / "generated_search_script.js" + self.schema_path = self.base_dir / "generated_product_schema.json" + self.results_path = self.base_dir / "extracted_products.json" + self.session_id = "amazon_r2d2_session" + + async def generate_search_script(self) -> str: + """Generate JavaScript for Amazon search interaction""" + print("🔧 Generating search script from header.html...") + + # Check if already generated + if self.search_script_path.exists(): + print("✅ Using cached search script") + return self.search_script_path.read_text() + + # Read the header HTML + header_html = (self.base_dir / "header.html").read_text() + + # Generate script using LLM + search_goal = """ + Find the search box and search button, then: + 1. Wait for the search box to be visible + 2. Click on the search box to focus it + 3. Clear any existing text + 4. Type "r2d2" into the search box + 5. Click the search submit button + 6. Wait for navigation to complete and search results to appear + """ + + try: + script = C4ACompiler.generate_script( + html=header_html, + query=search_goal, + mode="js" + ) + + # Save for future use + self.search_script_path.write_text(script) + print("✅ Search script generated and saved!") + print(f"📄 Script:\n{script}") + return script + + except Exception as e: + print(f"❌ Error generating search script: {e}") + + + async def generate_product_schema(self) -> Dict[str, Any]: + """Generate JSON CSS extraction schema from product HTML""" + print("\n🔧 Generating product extraction schema...") + + # Check if already generated + if self.schema_path.exists(): + print("✅ Using cached extraction schema") + return json.loads(self.schema_path.read_text()) + + # Read the product HTML + product_html = (self.base_dir / "product.html").read_text() + + # Generate extraction schema using LLM + schema_goal = """ + Create a JSON CSS extraction schema to extract: + - Product title (from the h2 element) + - Price (the dollar amount) + - Rating (star rating value) + - Number of reviews + - Delivery information + - Product URL (from the main product link) + - Whether it's sponsored + - Small business badge if present + + The schema should handle multiple products on a search results page. + """ + + try: + # Generate JavaScript that returns the schema + schema = JsonCssExtractionStrategy.generate_schema( + html=product_html, + query=schema_goal, + ) + + # Save for future use + self.schema_path.write_text(json.dumps(schema, indent=2)) + print("✅ Extraction schema generated and saved!") + print(f"📄 Schema fields: {[f['name'] for f in schema['fields']]}") + return schema + + except Exception as e: + print(f"❌ Error generating schema: {e}") + + async def crawl_amazon(self): + """Main crawling logic with 2 calls using same session""" + print("\n🚀 Starting Amazon R2D2 product search...") + + # Generate scripts and schemas + search_script = await self.generate_search_script() + product_schema = await self.generate_product_schema() + + # Configure browser (headless=False to see the action) + browser_config = BrowserConfig( + headless=False, + verbose=True, + viewport_width=1920, + viewport_height=1080 + ) + + async with AsyncWebCrawler(config=browser_config) as crawler: + print("\n📍 Step 1: Navigate to Amazon and search for R2D2") + + # FIRST CALL: Navigate to Amazon and execute search + search_config = CrawlerRunConfig( + session_id=self.session_id, + js_code= f"(() => {{ {search_script} }})()", # Execute generated JS + wait_for=".s-search-results", # Wait for search results + extraction_strategy=JsonCssExtractionStrategy(schema=product_schema), + delay_before_return_html=3.0 # Give time for results to load + ) + + results = await crawler.arun( + url="https://www.amazon.com", + config=search_config + ) + + if not results.success: + print("❌ Failed to search Amazon") + print(f"Error: {results.error_message}") + return + + print("✅ Search completed successfully!") + print("✅ Product extraction completed!") + + # Extract and save results + print("\n📍 Extracting product data") + + if results[0].extracted_content: + products = json.loads(results[0].extracted_content) + print(f"🔍 Found {len(products)} products in search results") + + print(f"✅ Extracted {len(products)} R2D2 products") + + # Save results + self.results_path.write_text( + json.dumps(products, indent=2) + ) + print(f"💾 Results saved to: {self.results_path}") + + # Print sample results + print("\n📊 Sample Results:") + for i, product in enumerate(products[:3], 1): + print(f"\n{i}. {product['title'][:60]}...") + print(f" Price: ${product['price']}") + print(f" Rating: {product['rating']} ({product['number_of_reviews']} reviews)") + print(f" {'🏪 Small Business' if product['small_business_badge'] else ''}") + print(f" {'📢 Sponsored' if product['sponsored'] else ''}") + + else: + print("❌ No products extracted") + + + +async def main(): + """Run the Amazon scraper""" + scraper = AmazonR2D2Scraper() + await scraper.crawl_amazon() + + print("\n🎉 Amazon R2D2 search example completed!") + print("Check the generated files:") + print(" - generated_search_script.js") + print(" - generated_product_schema.json") + print(" - extracted_products.json") + print(" - search_results_screenshot.png") + + +if __name__ == "__main__": + asyncio.run(main()) \ No newline at end of file diff --git a/docs/examples/c4a_script/amazon_example/extracted_products.json b/docs/examples/c4a_script/amazon_example/extracted_products.json new file mode 100644 index 00000000..72862949 --- /dev/null +++ b/docs/examples/c4a_script/amazon_example/extracted_products.json @@ -0,0 +1,114 @@ +[ + { + "title": "Death Star BB8 R2D2 Golf Balls with 20 Printed tees \u2022 Great Gift IDEA from Moms, DADS and Kids -", + "price": "$29.95", + "rating": "4.7 out of 5 stars", + "number_of_reviews": "184", + "delivery_info": "FREE delivery", + "product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfYXRmOjIwMDA2NzY0ODgwMjc5ODo6MDo6&url=%2FDeath-Star-R2D2-Balls-Printed%2Fdp%2FB081XSYZMS%2Fref%3Dsr_1_1_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-1-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9hdGY%26psc%3D1", + "sponsored": "Sponsored", + "small_business_badge": "Small Business" + }, + { + "title": "TEENKON French Press Insulated 304 Stainless Steel Coffee Maker, 32 Oz Robot R2D2 Hand Home Coffee Presser, with Filter Screen for Brew Coffee and Tea (White)", + "price": "$49.99", + "rating": "4.3 out of 5 stars", + "number_of_reviews": "82", + "delivery_info": "Delivery", + "product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjMwMDAzNzc4Njg4MDAwMjo6MDo6&url=%2FTEENKON-French-Insulated-Stainless-Presser%2Fdp%2FB0CD3HH5PN%2Fref%3Dsr_1_17_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-17-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1", + "sponsored": "Sponsored" + }, + { + "title": "3D Illusion LED Night Light,7 Colors Gradual Changing Touch Switch USB Table Lamp for Holiday Gifts or Home Decorations (R2-D2)", + "price": "$9.97", + "rating": "4.3 out of 5 stars", + "number_of_reviews": "235", + "delivery_info": "Delivery", + "product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjIwMDA0NjMwMTQwODA4MTo6MDo6&url=%2FIllusion-Gradual-Changing-Holiday-Decorations%2Fdp%2FB089NMBKF2%2Fref%3Dsr_1_18_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-18-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1", + "sponsored": "Sponsored" + }, + { + "title": "Paladone Star Wars R2-D2 Headlamp with Droid Sounds, Officially Licensed Disney Star Wars Head Lamp and Reading Light", + "price": "$21.99", + "rating": "4.1 out of 5 stars", + "number_of_reviews": "66", + "delivery_info": "FREE delivery", + "product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjMwMDI1NjA0MDQwMTUwMjo6MDo6&url=%2FSounds-Officially-Licensed-Headlamp-Flashlight%2Fdp%2FB09RTDZF8J%2Fref%3Dsr_1_19_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-19-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1", + "sponsored": "Sponsored" + }, + { + "title": "4 Pcs Set Star Wars Kylo Ren BB8 Stormtrooper R2D2 Silicone Travel Luggage Baggage Identification Labels ID Tag for Bag Suitcase Plane Cruise Ships with Belt Strap", + "price": "$16.99", + "rating": "4.7 out of 5 stars", + "number_of_reviews": "3,414", + "delivery_info": "FREE delivery", + "product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjIwMDAyMzk3ODkwMzIxMTo6MDo6&url=%2FFinex-Set-Suitcase-Adjustable-Stormtrooper%2Fdp%2FB01D1CBFJS%2Fref%3Dsr_1_24_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-24-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1", + "sponsored": "Sponsored", + "small_business_badge": "Small Business" + }, + { + "title": "Papyrus Star Wars Birthday Card Assortment, Darth Vader, Storm Trooper, and R2-D2 (3-Count)", + "price": "$23.16", + "rating": "4.8 out of 5 stars", + "number_of_reviews": "328", + "delivery_info": "FREE delivery", + "product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjMwMDcwNzI4MjA1MzcwMjo6MDo6&url=%2FPapyrus-Birthday-Assortment-Characters-3-Count%2Fdp%2FB07YT2ZPKX%2Fref%3Dsr_1_25_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-25-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1", + "sponsored": "Sponsored" + }, + { + "title": "STAR WARS R2-D2 Artoo 3D Top Motion Lamp, Mood Light | 18 Inches", + "price": "$69.99", + "rating": "4.5 out of 5 stars", + "number_of_reviews": "520", + "delivery_info": "FREE delivery", + "product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjIwMDA5NDc3MzczMTQ0MTo6MDo6&url=%2FR2-D2-Artoo-Motion-Light-Inches%2Fdp%2FB08MCWPHQR%2Fref%3Dsr_1_26_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-26-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1", + "sponsored": "Sponsored" + }, + { + "title": "Saturday Park Star Wars Droids Full Sheet Set - 4 Piece 100% Organic Cotton Sheets Features R2-D2 & BB-8 - GOTS & Oeko-TEX Certified (Star Wars Official)", + "price": "$70.00", + "rating": "4.5 out of 5 stars", + "number_of_reviews": "388", + "delivery_info": "FREE delivery", + "product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjMwMDAyMzI0NDI5MDQwMjo6MDo6&url=%2FSaturday-Park-Star-Droids-Sheet%2Fdp%2FB0BBSFX4J2%2Fref%3Dsr_1_27_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-27-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1", + "sponsored": "Sponsored", + "small_business_badge": "1 sustainability feature" + }, + { + "title": "AQUARIUS Star Wars R2D2 Action Figure Funky Chunky Novelty Magnet for Refrigerator, Locker, Whiteboard & Game Room Officially Licensed Merchandise & Collectibles", + "price": "$11.94", + "rating": "4.3 out of 5 stars", + "number_of_reviews": "10", + "delivery_info": "FREE delivery", + "product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjMwMDA5MDMwMzY5NjEwMjo6MDo6&url=%2FAQUARIUS-Refrigerator-Whiteboard-Merchandise-Collectibles%2Fdp%2FB09W8VKXGC%2Fref%3Dsr_1_32_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-32-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1", + "sponsored": "Sponsored" + }, + { + "title": "STAR WARS C-3PO and R2-D2 Men's Crew Socks 2 Pair Pack", + "price": "$11.95", + "rating": "4.7 out of 5 stars", + "number_of_reviews": "1,272", + "delivery_info": "Delivery", + "product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjIwMDAxMDk5NDkyMTg2MTo6MDo6&url=%2FStar-Wars-R2-D2-C-3PO-Socks%2Fdp%2FB0178IU1GY%2Fref%3Dsr_1_33_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-33-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1", + "sponsored": "Sponsored" + }, + { + "title": "Buckle-Down Belt Women's Cinch Star Wars R2D2 Bounding Parts3 White Black Blue Gray Available In Adjustable Sizes", + "price": "$24.95", + "rating": "4.3 out of 5 stars", + "number_of_reviews": "32", + "delivery_info": "FREE delivery", + "product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjMwMDY1OTQ5NTQ4MzkwMjo6MDo6&url=%2FWomens-Cinch-Bounding-Parts3-Inches%2Fdp%2FB07WK7RG4D%2Fref%3Dsr_1_34_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-34-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1", + "sponsored": "Sponsored", + "small_business_badge": "Small Business" + }, + { + "title": "Star Wars R2D2 Metal Head Vintage Disney+ T-Shirt", + "price": "$22.99", + "rating": "4.8 out of 5 stars", + "number_of_reviews": "869", + "product_url": "/sspa/click?ie=UTF8&spc=MToxNDMzMjA0MzA4MzEzMjAxOjE3NDkzMDI3NDY6c3BfbXRmOjIwMDA1OTUyMzgzNDMyMTo6MDo6&url=%2FStar-Wars-Vintage-Graphic-T-Shirt%2Fdp%2FB07H9PSNXS%2Fref%3Dsr_1_35_sspa%3Fdib%3DeyJ2IjoiMSJ9.iiJYY01upNMdD4BNNt8CYLZEIMXulNkcBlKEMJlr_U_h9eSGqChxwcIiCKUbJeEO_plLkXZvB7Yx-v4UDOCdiUFI-sHFgcTznXrP7tdD8xHpRaMKmaBDWMCAFwzPmVcgK_6Q9qIRoN4sp8tunKX26j5EC_8LiK-D5QximGkE8i8f-R5GhSUo__DaSkAP1cnzxUtSESfA8fYfewsZ1iSol9_zohE6r1ZZeawnWHPmDTkLqzCW3uK44EnvJbPFvzMlpiKcs9p9Eh9w5Rc5rrumMihdaWkC63B0cz5jU-S2Ieg._D8d5nv3hOExHPbZ04L-vaC7YwJjEZM-vu5AED5sz0U%26dib_tag%3Dse%26keywords%3Dr2d2%26qid%3D1749302746%26sr%3D8-35-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9tdGY%26psc%3D1", + "sponsored": "Sponsored", + "small_business_badge": "1 sustainability feature" + } +] \ No newline at end of file diff --git a/docs/examples/c4a_script/amazon_example/generated_product_schema.json b/docs/examples/c4a_script/amazon_example/generated_product_schema.json new file mode 100644 index 00000000..51cfc324 --- /dev/null +++ b/docs/examples/c4a_script/amazon_example/generated_product_schema.json @@ -0,0 +1,47 @@ +{ + "name": "Amazon Product Search Results", + "baseSelector": "div[data-component-type='s-impression-counter']", + "fields": [ + { + "name": "title", + "selector": "h2.a-size-base-plus.a-spacing-none.a-color-base.a-text-normal span", + "type": "text" + }, + { + "name": "price", + "selector": "span.a-price > span.a-offscreen", + "type": "text" + }, + { + "name": "rating", + "selector": "i.a-icon-star-small span.a-icon-alt", + "type": "text" + }, + { + "name": "number_of_reviews", + "selector": "a.a-link-normal.s-underline-text span.a-size-base", + "type": "text" + }, + { + "name": "delivery_info", + "selector": "div[data-cy='delivery-recipe'] span.a-color-base", + "type": "text" + }, + { + "name": "product_url", + "selector": "a.a-link-normal.s-no-outline", + "type": "attribute", + "attribute": "href" + }, + { + "name": "sponsored", + "selector": "span.puis-label-popover-default span.a-color-secondary", + "type": "text" + }, + { + "name": "small_business_badge", + "selector": "span.a-size-base.a-color-base", + "type": "text" + } + ] +} \ No newline at end of file diff --git a/docs/examples/c4a_script/amazon_example/generated_search_script.js b/docs/examples/c4a_script/amazon_example/generated_search_script.js new file mode 100644 index 00000000..e5f57851 --- /dev/null +++ b/docs/examples/c4a_script/amazon_example/generated_search_script.js @@ -0,0 +1,9 @@ +const searchBox = document.querySelector('#twotabsearchtextbox'); +const searchButton = document.querySelector('#nav-search-submit-button'); + +if (searchBox && searchButton) { + searchBox.focus(); + searchBox.value = ''; + searchBox.value = 'r2d2'; + searchButton.click(); +} \ No newline at end of file diff --git a/docs/examples/c4a_script/amazon_example/header.html b/docs/examples/c4a_script/amazon_example/header.html new file mode 100644 index 00000000..fff08cc6 --- /dev/null +++ b/docs/examples/c4a_script/amazon_example/header.html @@ -0,0 +1,214 @@ + \ No newline at end of file diff --git a/docs/examples/c4a_script/amazon_example/product.html b/docs/examples/c4a_script/amazon_example/product.html new file mode 100644 index 00000000..d68a7404 --- /dev/null +++ b/docs/examples/c4a_script/amazon_example/product.html @@ -0,0 +1,206 @@ +
+
+ + +
+ + + + + +
+
+
\ No newline at end of file diff --git a/docs/examples/c4a_script/generate_script_hello_world.py b/docs/examples/c4a_script/generate_script_hello_world.py new file mode 100644 index 00000000..66bb8cdc --- /dev/null +++ b/docs/examples/c4a_script/generate_script_hello_world.py @@ -0,0 +1,89 @@ +#!/usr/bin/env python3 +""" +Hello World Example: LLM-Generated C4A-Script + +This example shows how to use the new generate_script() function to automatically +create C4A-Script automation from natural language descriptions and HTML. +""" + +from crawl4ai.script.c4a_compile import C4ACompiler + +def main(): + print("🤖 C4A-Script Generation Hello World") + print("=" * 50) + + # Example 1: Simple login form + html = """ + + +
+ + + +
+ + + """ + + goal = "Fill in email 'user@example.com', password 'secret123', and submit the form" + + print("📝 Goal:", goal) + print("🌐 HTML: Simple login form") + print() + + # Generate C4A-Script + print("🔧 Generated C4A-Script:") + print("-" * 30) + c4a_script = C4ACompiler.generate_script( + html=html, + query=goal, + mode="c4a" + ) + print(c4a_script) + print() + + # Generate JavaScript + print("🔧 Generated JavaScript:") + print("-" * 30) + js_script = C4ACompiler.generate_script( + html=html, + query=goal, + mode="js" + ) + print(js_script) + print() + + # Example 2: Simple button click + html2 = """ + + +
+

Welcome!

+ +
+ + + """ + + goal2 = "Click the 'Get Started' button" + + print("=" * 50) + print("📝 Goal:", goal2) + print("🌐 HTML: Simple button") + print() + + print("🔧 Generated C4A-Script:") + print("-" * 30) + c4a_script2 = C4ACompiler.generate_script( + html=html2, + query=goal2, + mode="c4a" + ) + print(c4a_script2) + print() + + print("✅ Done! The LLM automatically converted natural language goals") + print(" into executable automation scripts.") + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/docs/examples/c4a_script/github_search/extracted_repositories.json b/docs/examples/c4a_script/github_search/extracted_repositories.json new file mode 100644 index 00000000..ddd0c3ec --- /dev/null +++ b/docs/examples/c4a_script/github_search/extracted_repositories.json @@ -0,0 +1,111 @@ +[ + { + "repository_name": "unclecode/crawl4ai", + "repository_owner": "unclecode/crawl4ai", + "repository_url": "/unclecode/crawl4ai", + "description": "\ud83d\ude80\ud83e\udd16Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here:https://discord.gg/jP8KfhDhyN", + "primary_language": "Python", + "star_count": "45.1k", + "topics": [], + "last_updated": "23 hours ago" + }, + { + "repository_name": "coleam00/mcp-crawl4ai-rag", + "repository_owner": "coleam00/mcp-crawl4ai-rag", + "repository_url": "/coleam00/mcp-crawl4ai-rag", + "description": "Web Crawling and RAG Capabilities for AI Agents and AI Coding Assistants", + "primary_language": "Python", + "star_count": "748", + "topics": [], + "last_updated": "yesterday" + }, + { + "repository_name": "pdichone/crawl4ai-rag-system", + "repository_owner": "pdichone/crawl4ai-rag-system", + "repository_url": "/pdichone/crawl4ai-rag-system", + "primary_language": "Python", + "star_count": "44", + "topics": [], + "last_updated": "on 21 Jan" + }, + { + "repository_name": "weidwonder/crawl4ai-mcp-server", + "repository_owner": "weidwonder/crawl4ai-mcp-server", + "repository_url": "/weidwonder/crawl4ai-mcp-server", + "description": "\u7528\u4e8e\u63d0\u4f9b\u7ed9\u672c\u5730\u5f00\u53d1\u8005\u7684 LLM\u7684\u9ad8\u6548\u4e92\u8054\u7f51\u641c\u7d22&\u5185\u5bb9\u83b7\u53d6\u7684MCP Server\uff0c \u8282\u7701\u4f60\u7684token", + "primary_language": "Python", + "star_count": "87", + "topics": [], + "last_updated": "24 days ago" + }, + { + "repository_name": "leonardogrig/crawl4ai-deepseek-example", + "repository_owner": "leonardogrig/crawl4ai-deepseek-example", + "repository_url": "/leonardogrig/crawl4ai-deepseek-example", + "primary_language": "Python", + "star_count": "29", + "topics": [], + "last_updated": "on 18 Jan" + }, + { + "repository_name": "laurentvv/crawl4ai-mcp", + "repository_owner": "laurentvv/crawl4ai-mcp", + "repository_url": "/laurentvv/crawl4ai-mcp", + "description": "Web crawling tool that integrates with AI assistants via the MCP", + "primary_language": "Python", + "star_count": "10", + "topics": [ + {}, + {}, + {}, + {}, + {} + ], + "last_updated": "on 16 Mar" + }, + { + "repository_name": "kaymen99/ai-web-scraper", + "repository_owner": "kaymen99/ai-web-scraper", + "repository_url": "/kaymen99/ai-web-scraper", + "description": "AI web scraper built withCrawl4AIfor extracting structured leads data from websites.", + "primary_language": "Python", + "star_count": "30", + "topics": [ + {}, + {}, + {}, + {}, + {} + ], + "last_updated": "on 13 Feb" + }, + { + "repository_name": "atakkant/ai_web_crawler", + "repository_owner": "atakkant/ai_web_crawler", + "repository_url": "/atakkant/ai_web_crawler", + "description": "crawl4ai, DeepSeek, Groq", + "primary_language": "Python", + "star_count": "9", + "topics": [], + "last_updated": "on 19 Feb" + }, + { + "repository_name": "Croups/auto-scraper-with-llms", + "repository_owner": "Croups/auto-scraper-with-llms", + "repository_url": "/Croups/auto-scraper-with-llms", + "description": "Web scraping AI that leverages thecrawl4ailibrary to extract structured data from web pages using various large language models (LLMs).", + "primary_language": "Python", + "star_count": "49", + "topics": [], + "last_updated": "on 8 Apr" + }, + { + "repository_name": "leonardogrig/crawl4ai_llm_examples", + "repository_owner": "leonardogrig/crawl4ai_llm_examples", + "repository_url": "/leonardogrig/crawl4ai_llm_examples", + "primary_language": "Python", + "star_count": "8", + "topics": [], + "last_updated": "on 29 Jan" + } +] \ No newline at end of file diff --git a/docs/examples/c4a_script/github_search/generated_result_schema.json b/docs/examples/c4a_script/github_search/generated_result_schema.json new file mode 100644 index 00000000..ece023a4 --- /dev/null +++ b/docs/examples/c4a_script/github_search/generated_result_schema.json @@ -0,0 +1,66 @@ +{ + "name": "GitHub Repository Cards", + "baseSelector": "div.Box-sc-g0xbh4-0.iwUbcA", + "fields": [ + { + "name": "repository_name", + "selector": "div.search-title a span", + "type": "text", + "transform": "strip" + }, + { + "name": "repository_owner", + "selector": "div.search-title a span", + "type": "text", + "transform": "split", + "pattern": "/" + }, + { + "name": "repository_url", + "selector": "div.search-title a", + "type": "attribute", + "attribute": "href", + "transform": "prepend", + "pattern": "https://github.com" + }, + { + "name": "description", + "selector": "div.dcdlju span", + "type": "text" + }, + { + "name": "primary_language", + "selector": "ul.bZkODq li span[aria-label]", + "type": "text" + }, + { + "name": "star_count", + "selector": "ul.bZkODq li a[href*='stargazers'] span", + "type": "text", + "transform": "strip" + }, + { + "name": "topics", + "type": "list", + "selector": "div.jgRnBg div a", + "fields": [ + { + "name": "topic_name", + "selector": "a", + "type": "text" + } + ] + }, + { + "name": "last_updated", + "selector": "ul.bZkODq li span[title]", + "type": "text" + }, + { + "name": "has_sponsor_button", + "selector": "button[aria-label*='Sponsor']", + "type": "text", + "transform": "exists" + } + ] +} \ No newline at end of file diff --git a/docs/examples/c4a_script/github_search/generated_search_script.js b/docs/examples/c4a_script/github_search/generated_search_script.js new file mode 100644 index 00000000..7bc2818b --- /dev/null +++ b/docs/examples/c4a_script/github_search/generated_search_script.js @@ -0,0 +1,39 @@ +(async () => { + const waitForElement = (selector, timeout = 10000) => new Promise((resolve, reject) => { + const el = document.querySelector(selector); + if (el) return resolve(el); + const observer = new MutationObserver(() => { + const el = document.querySelector(selector); + if (el) { + observer.disconnect(); + resolve(el); + } + }); + observer.observe(document.body, { childList: true, subtree: true }); + setTimeout(() => { + observer.disconnect(); + reject(new Error(`Timeout waiting for ${selector}`)); + }, timeout); + }); + + try { + const searchInput = await waitForElement('#adv_code_search input[type="text"]'); + searchInput.value = 'crawl4AI'; + searchInput.dispatchEvent(new Event('input', { bubbles: true })); + + const languageSelect = await waitForElement('#search_language'); + languageSelect.value = 'Python'; + languageSelect.dispatchEvent(new Event('change', { bubbles: true })); + + const starsInput = await waitForElement('#search_stars'); + starsInput.value = '>10000'; + starsInput.dispatchEvent(new Event('input', { bubbles: true })); + + const searchButton = await waitForElement('#adv_code_search button[type="submit"]'); + searchButton.click(); + + await waitForElement('.codesearch-results, #search-results'); + } catch (e) { + console.error('Search script failed:', e.message); + } +})(); \ No newline at end of file diff --git a/docs/examples/c4a_script/github_search/github_search_crawler.py b/docs/examples/c4a_script/github_search/github_search_crawler.py new file mode 100644 index 00000000..71b936e3 --- /dev/null +++ b/docs/examples/c4a_script/github_search/github_search_crawler.py @@ -0,0 +1,211 @@ +#!/usr/bin/env python3 +""" +GitHub Advanced Search Example using Crawl4AI + +This example demonstrates: +1. Using LLM to generate C4A-Script from HTML snippets +2. Single arun() call with navigation, search form filling, and extraction +3. JSON CSS extraction for structured repository data +4. Complete workflow: navigate → fill form → submit → extract results + +Requirements: +- Crawl4AI with generate_script support +- LLM API key (configured in environment) +""" + +import asyncio +import json +import os +from pathlib import Path +from typing import List, Dict, Any + +from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode +from crawl4ai.extraction_strategy import JsonCssExtractionStrategy +from crawl4ai.script.c4a_compile import C4ACompiler + + +class GitHubSearchScraper: + def __init__(self): + self.base_dir = Path(__file__).parent + self.search_script_path = self.base_dir / "generated_search_script.js" + self.schema_path = self.base_dir / "generated_result_schema.json" + self.results_path = self.base_dir / "extracted_repositories.json" + self.session_id = "github_search_session" + + async def generate_search_script(self) -> str: + """Generate JavaScript for GitHub advanced search interaction""" + print("🔧 Generating search script from search_form.html...") + + # Check if already generated + if self.search_script_path.exists(): + print("✅ Using cached search script") + return self.search_script_path.read_text() + + # Read the search form HTML + search_form_html = (self.base_dir / "search_form.html").read_text() + + # Generate script using LLM + search_goal = """ + Search for crawl4AI repositories written in Python with more than 10000 stars: + 1. Wait for the main search input to be visible + 2. Type "crawl4AI" into the main search box + 3. Select "Python" from the language dropdown (#search_language) + 4. Type ">10000" into the stars input field (#search_stars) + 5. Click the search button to submit the form + 6. Wait for the search results to appear + """ + + try: + script = C4ACompiler.generate_script( + html=search_form_html, + query=search_goal, + mode="js" + ) + + # Save for future use + self.search_script_path.write_text(script) + print("✅ Search script generated and saved!") + print(f"📄 Script preview:\n{script[:500]}...") + return script + + except Exception as e: + print(f"❌ Error generating search script: {e}") + raise + + + async def generate_result_schema(self) -> Dict[str, Any]: + """Generate JSON CSS extraction schema from result HTML""" + print("\n🔧 Generating result extraction schema...") + + # Check if already generated + if self.schema_path.exists(): + print("✅ Using cached extraction schema") + return json.loads(self.schema_path.read_text()) + + # Read the result HTML + result_html = (self.base_dir / "result.html").read_text() + + # Generate extraction schema using LLM + schema_goal = """ + Create a JSON CSS extraction schema to extract from each repository card: + - Repository name (the repository name only, not including owner) + - Repository owner (organization or username) + - Repository URL (full GitHub URL) + - Description + - Primary programming language + - Star count (numeric value) + - Topics/tags (array of topic names) + - Last updated (time ago string) + - Whether it has a sponsor button + + The schema should handle multiple repository results on the search results page. + """ + + try: + # Generate schema + schema = JsonCssExtractionStrategy.generate_schema( + html=result_html, + query=schema_goal, + ) + + # Save for future use + self.schema_path.write_text(json.dumps(schema, indent=2)) + print("✅ Extraction schema generated and saved!") + print(f"📄 Schema fields: {[f['name'] for f in schema['fields']]}") + return schema + + except Exception as e: + print(f"❌ Error generating schema: {e}") + raise + + async def crawl_github(self): + """Main crawling logic with single arun() call""" + print("\n🚀 Starting GitHub repository search...") + + # Generate scripts and schemas + search_script = await self.generate_search_script() + result_schema = await self.generate_result_schema() + + # Configure browser (headless=False to see the action) + browser_config = BrowserConfig( + headless=False, + verbose=True, + viewport_width=1920, + viewport_height=1080 + ) + + async with AsyncWebCrawler(config=browser_config) as crawler: + print("\n📍 Navigating to GitHub advanced search and executing search...") + + # Single call: Navigate, execute search, and extract results + search_config = CrawlerRunConfig( + session_id=self.session_id, + js_code=search_script, # Execute generated JS + # wait_for="[data-testid='results-list']", # Wait for search results + wait_for=".Box-sc-g0xbh4-0.iwUbcA", # Wait for search results + extraction_strategy=JsonCssExtractionStrategy(schema=result_schema), + delay_before_return_html=3.0, # Give time for results to fully load + cache_mode=CacheMode.BYPASS # Don't cache for fresh results + ) + + result = await crawler.arun( + url="https://github.com/search/advanced", + config=search_config + ) + + if not result.success: + print("❌ Failed to search GitHub") + print(f"Error: {result.error_message}") + return + + print("✅ Search and extraction completed successfully!") + + # Extract and save results + if result.extracted_content: + repositories = json.loads(result.extracted_content) + print(f"\n🔍 Found {len(repositories)} repositories matching criteria") + + # Save results + self.results_path.write_text( + json.dumps(repositories, indent=2) + ) + print(f"💾 Results saved to: {self.results_path}") + + # Print sample results + print("\n📊 Sample Results:") + for i, repo in enumerate(repositories[:5], 1): + print(f"\n{i}. {repo.get('owner', 'Unknown')}/{repo.get('name', 'Unknown')}") + print(f" Description: {repo.get('description', 'No description')[:80]}...") + print(f" Language: {repo.get('language', 'Unknown')}") + print(f" Stars: {repo.get('stars', 'Unknown')}") + print(f" Updated: {repo.get('last_updated', 'Unknown')}") + if repo.get('topics'): + print(f" Topics: {', '.join(repo['topics'][:5])}") + print(f" URL: {repo.get('url', 'Unknown')}") + + else: + print("❌ No repositories extracted") + + # Save screenshot for reference + if result.screenshot: + screenshot_path = self.base_dir / "search_results_screenshot.png" + with open(screenshot_path, "wb") as f: + f.write(result.screenshot) + print(f"\n📸 Screenshot saved to: {screenshot_path}") + + +async def main(): + """Run the GitHub search scraper""" + scraper = GitHubSearchScraper() + await scraper.crawl_github() + + print("\n🎉 GitHub search example completed!") + print("Check the generated files:") + print(" - generated_search_script.js") + print(" - generated_result_schema.json") + print(" - extracted_repositories.json") + print(" - search_results_screenshot.png") + + +if __name__ == "__main__": + asyncio.run(main()) \ No newline at end of file diff --git a/docs/examples/c4a_script/github_search/result.html b/docs/examples/c4a_script/github_search/result.html new file mode 100644 index 00000000..69ee3651 --- /dev/null +++ b/docs/examples/c4a_script/github_search/result.html @@ -0,0 +1,54 @@ +

All Algorithms implemented in Python
  • Python
  • 201k
  • Updated
    4 days ago
+ + + +
+
+
+

+ Sponsor TheAlgorithms/Python +

+ +
+
+ +
+
+ +
+ +
+
+
External links
+ + +
+ +
+ +
+
+ +
+
\ No newline at end of file diff --git a/docs/examples/c4a_script/github_search/search_form.html b/docs/examples/c4a_script/github_search/search_form.html new file mode 100644 index 00000000..cc7b083e --- /dev/null +++ b/docs/examples/c4a_script/github_search/search_form.html @@ -0,0 +1,336 @@ +
+ +
+
+

Advanced search

+ +
+
+ +
+
+

Advanced options

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ +
+
+
+
+

Repositories options

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ +
+
+ +
+
+

Code options

+
+
+
+ +
+
+
+
+
+
+
+
+
+ +
+
+ +
+
+

Issues options

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

Users options

+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ +
+
+
+
+

Wiki options

+
+
+
+
+
+
+
+
+
+ +
\ No newline at end of file diff --git a/docs/examples/c4a_script/examples/add_to_cart.c4a b/docs/examples/c4a_script/script_samples/add_to_cart.c4a similarity index 100% rename from docs/examples/c4a_script/examples/add_to_cart.c4a rename to docs/examples/c4a_script/script_samples/add_to_cart.c4a diff --git a/docs/examples/c4a_script/examples/advanced_control_flow.c4a b/docs/examples/c4a_script/script_samples/advanced_control_flow.c4a similarity index 100% rename from docs/examples/c4a_script/examples/advanced_control_flow.c4a rename to docs/examples/c4a_script/script_samples/advanced_control_flow.c4a diff --git a/docs/examples/c4a_script/examples/conditional_login.c4a b/docs/examples/c4a_script/script_samples/conditional_login.c4a similarity index 100% rename from docs/examples/c4a_script/examples/conditional_login.c4a rename to docs/examples/c4a_script/script_samples/conditional_login.c4a diff --git a/docs/examples/c4a_script/examples/data_extraction.c4a b/docs/examples/c4a_script/script_samples/data_extraction.c4a similarity index 100% rename from docs/examples/c4a_script/examples/data_extraction.c4a rename to docs/examples/c4a_script/script_samples/data_extraction.c4a diff --git a/docs/examples/c4a_script/examples/fill_contact.c4a b/docs/examples/c4a_script/script_samples/fill_contact.c4a similarity index 100% rename from docs/examples/c4a_script/examples/fill_contact.c4a rename to docs/examples/c4a_script/script_samples/fill_contact.c4a diff --git a/docs/examples/c4a_script/examples/load_more_content.c4a b/docs/examples/c4a_script/script_samples/load_more_content.c4a similarity index 100% rename from docs/examples/c4a_script/examples/load_more_content.c4a rename to docs/examples/c4a_script/script_samples/load_more_content.c4a diff --git a/docs/examples/c4a_script/examples/login_flow.c4a b/docs/examples/c4a_script/script_samples/login_flow.c4a similarity index 100% rename from docs/examples/c4a_script/examples/login_flow.c4a rename to docs/examples/c4a_script/script_samples/login_flow.c4a diff --git a/docs/examples/c4a_script/examples/multi_step_workflow.c4a b/docs/examples/c4a_script/script_samples/multi_step_workflow.c4a similarity index 100% rename from docs/examples/c4a_script/examples/multi_step_workflow.c4a rename to docs/examples/c4a_script/script_samples/multi_step_workflow.c4a diff --git a/docs/examples/c4a_script/examples/navigate_tabs.c4a b/docs/examples/c4a_script/script_samples/navigate_tabs.c4a similarity index 100% rename from docs/examples/c4a_script/examples/navigate_tabs.c4a rename to docs/examples/c4a_script/script_samples/navigate_tabs.c4a diff --git a/docs/examples/c4a_script/examples/quick_login.c4a b/docs/examples/c4a_script/script_samples/quick_login.c4a similarity index 100% rename from docs/examples/c4a_script/examples/quick_login.c4a rename to docs/examples/c4a_script/script_samples/quick_login.c4a diff --git a/docs/examples/c4a_script/examples/responsive_actions.c4a b/docs/examples/c4a_script/script_samples/responsive_actions.c4a similarity index 100% rename from docs/examples/c4a_script/examples/responsive_actions.c4a rename to docs/examples/c4a_script/script_samples/responsive_actions.c4a diff --git a/docs/examples/c4a_script/examples/scroll_and_click.c4a b/docs/examples/c4a_script/script_samples/scroll_and_click.c4a similarity index 100% rename from docs/examples/c4a_script/examples/scroll_and_click.c4a rename to docs/examples/c4a_script/script_samples/scroll_and_click.c4a diff --git a/docs/examples/c4a_script/examples/search_product.c4a b/docs/examples/c4a_script/script_samples/search_product.c4a similarity index 100% rename from docs/examples/c4a_script/examples/search_product.c4a rename to docs/examples/c4a_script/script_samples/search_product.c4a diff --git a/docs/examples/c4a_script/examples/simple_form.c4a b/docs/examples/c4a_script/script_samples/simple_form.c4a similarity index 100% rename from docs/examples/c4a_script/examples/simple_form.c4a rename to docs/examples/c4a_script/script_samples/simple_form.c4a diff --git a/docs/examples/c4a_script/examples/smart_form_fill.c4a b/docs/examples/c4a_script/script_samples/smart_form_fill.c4a similarity index 100% rename from docs/examples/c4a_script/examples/smart_form_fill.c4a rename to docs/examples/c4a_script/script_samples/smart_form_fill.c4a diff --git a/docs/examples/c4a_script/tutorial/README.md b/docs/examples/c4a_script/tutorial/README.md index 088435af..81f855ee 100644 --- a/docs/examples/c4a_script/tutorial/README.md +++ b/docs/examples/c4a_script/tutorial/README.md @@ -1,17 +1,37 @@ # C4A-Script Interactive Tutorial -Welcome to the C4A-Script Interactive Tutorial! This hands-on tutorial teaches you how to write web automation scripts using C4A-Script, a domain-specific language for Crawl4AI. +A comprehensive web-based tutorial for learning and experimenting with C4A-Script - Crawl4AI's visual web automation language. ## 🚀 Quick Start -### 1. Start the Tutorial Server +### Prerequisites +- Python 3.7+ +- Modern web browser (Chrome, Firefox, Safari, Edge) -```bash -cd docs/examples/c4a_script/tutorial -python server.py -``` +### Running the Tutorial -Then open your browser to: http://localhost:8080 +1. **Clone and Navigate** + ```bash + git clone https://github.com/unclecode/crawl4ai.git + cd crawl4ai/docs/examples/c4a_script/tutorial/ + ``` + +2. **Install Dependencies** + ```bash + pip install flask + ``` + +3. **Launch the Server** + ```bash + python server.py + ``` + +4. **Open in Browser** + ``` + http://localhost:8080 + ``` + +**🌐 Try Online**: [Live Demo](https://docs.crawl4ai.com/c4a-script/demo) ### 2. Try Your First Script @@ -23,7 +43,16 @@ IF (EXISTS `.cookie-banner`) THEN CLICK `.accept` CLICK `#start-tutorial` ``` -## 📚 What You'll Learn +## 🎯 What You'll Learn + +### Core Features +- **📝 Text Editor**: Write C4A-Script with syntax highlighting +- **🧩 Visual Editor**: Build scripts using drag-and-drop Blockly interface +- **🎬 Recording Mode**: Capture browser actions and auto-generate scripts +- **⚡ Live Execution**: Run scripts in real-time with instant feedback +- **📊 Timeline View**: Visualize and edit automation steps + +## 📚 Tutorial Content ### Basic Commands - **Navigation**: `GO url` @@ -237,10 +266,131 @@ Check the `scripts/` folder for complete examples: - `04-multi-step-form.c4a` - Complex forms - `05-complex-workflow.c4a` - Full automation +## 🏗️ Developer Guide + +### Project Architecture + +``` +tutorial/ +├── server.py # Flask application server +├── assets/ # Tutorial-specific assets +│ ├── app.js # Main application logic +│ ├── c4a-blocks.js # Custom Blockly blocks +│ ├── c4a-generator.js # Code generation +│ ├── blockly-manager.js # Blockly integration +│ └── styles.css # Main styling +├── playground/ # Interactive demo environment +│ ├── index.html # Demo web application +│ ├── app.js # Demo app logic +│ └── styles.css # Demo styling +├── scripts/ # Example C4A scripts +└── index.html # Main tutorial interface +``` + +### Key Components + +#### 1. TutorialApp (`assets/app.js`) +Main application controller managing: +- Code editor integration (CodeMirror) +- Script execution and browser preview +- Tutorial navigation and lessons +- State management and persistence + +#### 2. BlocklyManager (`assets/blockly-manager.js`) +Visual programming interface: +- Custom C4A-Script block definitions +- Bidirectional sync between visual blocks and text +- Real-time code generation +- Dark theme integration + +#### 3. Recording System +Powers the recording functionality: +- Browser event capture +- Smart event grouping and filtering +- Automatic C4A-Script generation +- Timeline visualization + +### Customization + +#### Adding New Commands +1. **Define Block** (`assets/c4a-blocks.js`) +2. **Add Generator** (`assets/c4a-generator.js`) +3. **Update Parser** (`assets/blockly-manager.js`) + +#### Themes and Styling +- Main styles: `assets/styles.css` +- Theme variables: CSS custom properties +- Dark mode: Auto-applied based on system preference + +### Configuration +```python +# server.py configuration +PORT = 8080 +DEBUG = True +THREADED = True +``` + +### API Endpoints +- `GET /` - Main tutorial interface +- `GET /playground/` - Interactive demo environment +- `POST /execute` - Script execution endpoint +- `GET /examples/ + + + + + + + \ No newline at end of file diff --git a/docs/examples/c4a_script/tutorial/test_blockly.html b/docs/examples/c4a_script/tutorial/test_blockly.html new file mode 100644 index 00000000..0bf9f066 --- /dev/null +++ b/docs/examples/c4a_script/tutorial/test_blockly.html @@ -0,0 +1,69 @@ + + + + + + Blockly Test + + + +

C4A-Script Blockly Test

+
+
+

Generated C4A-Script:

+

+    
+ + + + + + \ No newline at end of file diff --git a/docs/md_v2/api/c4a-script-reference.md b/docs/md_v2/api/c4a-script-reference.md new file mode 100644 index 00000000..57de39b2 --- /dev/null +++ b/docs/md_v2/api/c4a-script-reference.md @@ -0,0 +1,992 @@ +# C4A-Script API Reference + +Complete reference for all C4A-Script commands, syntax, and advanced features. + +## Command Categories + +### 🧭 Navigation Commands + +Navigate between pages and manage browser history. + +#### `GO ` +Navigate to a specific URL. + +**Syntax:** +```c4a +GO +``` + +**Parameters:** +- `url` - Target URL (string) + +**Examples:** +```c4a +GO https://example.com +GO https://api.example.com/login +GO /relative/path +``` + +**Notes:** +- Supports both absolute and relative URLs +- Automatically handles protocol detection +- Waits for page load to complete + +--- + +#### `RELOAD` +Refresh the current page. + +**Syntax:** +```c4a +RELOAD +``` + +**Examples:** +```c4a +RELOAD +``` + +**Notes:** +- Equivalent to pressing F5 or clicking browser refresh +- Waits for page reload to complete +- Preserves current URL + +--- + +#### `BACK` +Navigate back in browser history. + +**Syntax:** +```c4a +BACK +``` + +**Examples:** +```c4a +BACK +``` + +**Notes:** +- Equivalent to clicking browser back button +- Does nothing if no previous page exists +- Waits for navigation to complete + +--- + +#### `FORWARD` +Navigate forward in browser history. + +**Syntax:** +```c4a +FORWARD +``` + +**Examples:** +```c4a +FORWARD +``` + +**Notes:** +- Equivalent to clicking browser forward button +- Does nothing if no next page exists +- Waits for navigation to complete + +### ⏱️ Wait Commands + +Control timing and synchronization with page elements. + +#### `WAIT