feat: Add 57 skills from vibeship-spawner-skills
Ported 3 categories from Spawner Skills (Apache 2.0): - AI Agents (21 skills): langfuse, langgraph, crewai, rag-engineer, etc. - Integrations (25 skills): stripe, firebase, vercel, supabase, etc. - Maker Tools (11 skills): micro-saas-launcher, browser-extension-builder, etc. All skills converted from 4-file YAML to SKILL.md format. Source: https://github.com/vibeforge1111/vibeship-spawner-skills
This commit is contained in:
315
skills/computer-use-agents/SKILL.md
Normal file
315
skills/computer-use-agents/SKILL.md
Normal file
@@ -0,0 +1,315 @@
|
||||
---
|
||||
name: computer-use-agents
|
||||
description: "Build AI agents that interact with computers like humans do - viewing screens, moving cursors, clicking buttons, and typing text. Covers Anthropic's Computer Use, OpenAI's Operator/CUA, and open-source alternatives. Critical focus on sandboxing, security, and handling the unique challenges of vision-based control. Use when: computer use, desktop automation agent, screen control AI, vision-based agent, GUI automation."
|
||||
source: vibeship-spawner-skills (Apache 2.0)
|
||||
---
|
||||
|
||||
# Computer Use Agents
|
||||
|
||||
## Patterns
|
||||
|
||||
### Perception-Reasoning-Action Loop
|
||||
|
||||
The fundamental architecture of computer use agents: observe screen,
|
||||
reason about next action, execute action, repeat. This loop integrates
|
||||
vision models with action execution through an iterative pipeline.
|
||||
|
||||
Key components:
|
||||
1. PERCEPTION: Screenshot captures current screen state
|
||||
2. REASONING: Vision-language model analyzes and plans
|
||||
3. ACTION: Execute mouse/keyboard operations
|
||||
4. FEEDBACK: Observe result, continue or correct
|
||||
|
||||
Critical insight: Vision agents are completely still during "thinking"
|
||||
phase (1-5 seconds), creating a detectable pause pattern.
|
||||
|
||||
|
||||
**When to use**: ['Building any computer use agent from scratch', 'Integrating vision models with desktop control', 'Understanding agent behavior patterns']
|
||||
|
||||
```python
|
||||
from anthropic import Anthropic
|
||||
from PIL import Image
|
||||
import base64
|
||||
import pyautogui
|
||||
import time
|
||||
|
||||
class ComputerUseAgent:
|
||||
"""
|
||||
Perception-Reasoning-Action loop implementation.
|
||||
Based on Anthropic Computer Use patterns.
|
||||
"""
|
||||
|
||||
def __init__(self, client: Anthropic, model: str = "claude-sonnet-4-20250514"):
|
||||
self.client = client
|
||||
self.model = model
|
||||
self.max_steps = 50 # Prevent runaway loops
|
||||
self.action_delay = 0.5 # Seconds between actions
|
||||
|
||||
def capture_screenshot(self) -> str:
|
||||
"""Capture screen and return base64 encoded image."""
|
||||
screenshot = pyautogui.screenshot()
|
||||
# Resize for token efficiency (1280x800 is good balance)
|
||||
screenshot = screenshot.resize((1280, 800), Image.LANCZOS)
|
||||
|
||||
import io
|
||||
buffer = io.BytesIO()
|
||||
screenshot.save(buffer, format="PNG")
|
||||
return base64.b64encode(buffer.getvalue()).decode()
|
||||
|
||||
def execute_action(self, action: dict) -> dict:
|
||||
"""Execute mouse/keyboard action on the computer."""
|
||||
action_type = action.get("type")
|
||||
|
||||
if action_type == "click":
|
||||
x, y = action["x"], action["y"]
|
||||
button = action.get("button", "left")
|
||||
pyautogui.click(x, y, button=button)
|
||||
return {"success": True, "action": f"clicked at ({x}, {y})"}
|
||||
|
||||
elif action_type == "type":
|
||||
text = action["text"]
|
||||
pyautogui.typewrite(text, interval=0.02)
|
||||
return {"success": True, "action": f"typed {len(text)} chars"}
|
||||
|
||||
elif action_type == "key":
|
||||
key = action["key"]
|
||||
pyautogui.press(key)
|
||||
return {"success": True, "action": f"pressed {key}"}
|
||||
|
||||
elif action_type == "scroll":
|
||||
direction = action.get("direction", "down")
|
||||
amount = action.get("amount", 3)
|
||||
scroll = -amount if direction == "down" else amount
|
||||
pyautogui.scroll(scroll)
|
||||
return {"success": True, "action": f"scrolled {dir
|
||||
```
|
||||
|
||||
### Sandboxed Environment Pattern
|
||||
|
||||
Computer use agents MUST run in isolated, sandboxed environments.
|
||||
Never give agents direct access to your main system - the security
|
||||
risks are too high. Use Docker containers with virtual desktops.
|
||||
|
||||
Key isolation requirements:
|
||||
1. NETWORK: Restrict to necessary endpoints only
|
||||
2. FILESYSTEM: Read-only or scoped to temp directories
|
||||
3. CREDENTIALS: No access to host credentials
|
||||
4. SYSCALLS: Filter dangerous system calls
|
||||
5. RESOURCES: Limit CPU, memory, time
|
||||
|
||||
The goal is "blast radius minimization" - if the agent goes wrong,
|
||||
damage is contained to the sandbox.
|
||||
|
||||
|
||||
**When to use**: ['Deploying any computer use agent', 'Testing agent behavior safely', 'Running untrusted automation tasks']
|
||||
|
||||
```python
|
||||
# Dockerfile for sandboxed computer use environment
|
||||
# Based on Anthropic's reference implementation pattern
|
||||
|
||||
FROM ubuntu:22.04
|
||||
|
||||
# Install desktop environment
|
||||
RUN apt-get update && apt-get install -y \
|
||||
xvfb \
|
||||
x11vnc \
|
||||
fluxbox \
|
||||
xterm \
|
||||
firefox \
|
||||
python3 \
|
||||
python3-pip \
|
||||
supervisor
|
||||
|
||||
# Security: Create non-root user
|
||||
RUN useradd -m -s /bin/bash agent && \
|
||||
mkdir -p /home/agent/.vnc
|
||||
|
||||
# Install Python dependencies
|
||||
COPY requirements.txt /tmp/
|
||||
RUN pip3 install -r /tmp/requirements.txt
|
||||
|
||||
# Security: Drop capabilities
|
||||
RUN apt-get install -y --no-install-recommends libcap2-bin && \
|
||||
setcap -r /usr/bin/python3 || true
|
||||
|
||||
# Copy agent code
|
||||
COPY --chown=agent:agent . /app
|
||||
WORKDIR /app
|
||||
|
||||
# Supervisor config for virtual display + VNC
|
||||
COPY supervisord.conf /etc/supervisor/conf.d/
|
||||
|
||||
# Expose VNC port only (not desktop directly)
|
||||
EXPOSE 5900
|
||||
|
||||
# Run as non-root
|
||||
USER agent
|
||||
|
||||
CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]
|
||||
|
||||
---
|
||||
|
||||
# docker-compose.yml with security constraints
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
computer-use-agent:
|
||||
build: .
|
||||
ports:
|
||||
- "5900:5900" # VNC for observation
|
||||
- "8080:8080" # API for control
|
||||
|
||||
# Security constraints
|
||||
security_opt:
|
||||
- no-new-privileges:true
|
||||
- seccomp:seccomp-profile.json
|
||||
|
||||
# Resource limits
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
cpus: '2'
|
||||
memory: 4G
|
||||
reservations:
|
||||
cpus: '0.5'
|
||||
memory: 1G
|
||||
|
||||
# Network isolation
|
||||
networks:
|
||||
- agent-network
|
||||
|
||||
# No access to host filesystem
|
||||
volumes:
|
||||
- agent-tmp:/tmp
|
||||
|
||||
# Read-only root filesystem
|
||||
read_only: true
|
||||
tmpfs:
|
||||
- /run
|
||||
- /var/run
|
||||
|
||||
# Environment
|
||||
environment:
|
||||
- DISPLAY=:99
|
||||
- NO_PROXY=localhost
|
||||
|
||||
networks:
|
||||
agent-network:
|
||||
driver: bridge
|
||||
internal: true # No internet by default
|
||||
|
||||
volumes:
|
||||
agent-tmp:
|
||||
|
||||
---
|
||||
|
||||
# Python wrapper with additional runtime sandboxing
|
||||
import subprocess
|
||||
import os
|
||||
from dataclasses im
|
||||
```
|
||||
|
||||
### Anthropic Computer Use Implementation
|
||||
|
||||
Official implementation pattern using Claude's computer use capability.
|
||||
Claude 3.5 Sonnet was the first frontier model to offer computer use.
|
||||
Claude Opus 4.5 is now the "best model in the world for computer use."
|
||||
|
||||
Key capabilities:
|
||||
- screenshot: Capture current screen state
|
||||
- mouse: Click, move, drag operations
|
||||
- keyboard: Type text, press keys
|
||||
- bash: Run shell commands
|
||||
- text_editor: View and edit files
|
||||
|
||||
Tool versions:
|
||||
- computer_20251124 (Opus 4.5): Adds zoom action for detailed inspection
|
||||
- computer_20250124 (All other models): Standard capabilities
|
||||
|
||||
Critical limitation: "Some UI elements (like dropdowns and scrollbars)
|
||||
might be tricky for Claude to manipulate" - Anthropic docs
|
||||
|
||||
|
||||
**When to use**: ['Building production computer use agents', 'Need highest quality vision understanding', 'Full desktop control (not just browser)']
|
||||
|
||||
```python
|
||||
from anthropic import Anthropic
|
||||
from anthropic.types.beta import (
|
||||
BetaToolComputerUse20241022,
|
||||
BetaToolBash20241022,
|
||||
BetaToolTextEditor20241022,
|
||||
)
|
||||
import subprocess
|
||||
import base64
|
||||
from PIL import Image
|
||||
import io
|
||||
|
||||
class AnthropicComputerUse:
|
||||
"""
|
||||
Official Anthropic Computer Use implementation.
|
||||
|
||||
Requires:
|
||||
- Docker container with virtual display
|
||||
- VNC for viewing agent actions
|
||||
- Proper tool implementations
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self.client = Anthropic()
|
||||
self.model = "claude-sonnet-4-20250514" # Best for computer use
|
||||
self.screen_size = (1280, 800)
|
||||
|
||||
def get_tools(self) -> list:
|
||||
"""Define computer use tools."""
|
||||
return [
|
||||
BetaToolComputerUse20241022(
|
||||
type="computer_20241022",
|
||||
name="computer",
|
||||
display_width_px=self.screen_size[0],
|
||||
display_height_px=self.screen_size[1],
|
||||
),
|
||||
BetaToolBash20241022(
|
||||
type="bash_20241022",
|
||||
name="bash",
|
||||
),
|
||||
BetaToolTextEditor20241022(
|
||||
type="text_editor_20241022",
|
||||
name="str_replace_editor",
|
||||
),
|
||||
]
|
||||
|
||||
def execute_tool(self, name: str, input: dict) -> dict:
|
||||
"""Execute a tool and return result."""
|
||||
|
||||
if name == "computer":
|
||||
return self._handle_computer_action(input)
|
||||
elif name == "bash":
|
||||
return self._handle_bash(input)
|
||||
elif name == "str_replace_editor":
|
||||
return self._handle_editor(input)
|
||||
else:
|
||||
return {"error": f"Unknown tool: {name}"}
|
||||
|
||||
def _handle_computer_action(self, input: dict) -> dict:
|
||||
"""Handle computer control actions."""
|
||||
action = input.get("action")
|
||||
|
||||
if action == "screenshot":
|
||||
# Capture via xdotool/scrot
|
||||
subprocess.run(["scrot", "/tmp/screenshot.png"])
|
||||
|
||||
with open("/tmp/screenshot.png", "rb") as f:
|
||||
|
||||
```
|
||||
|
||||
## ⚠️ Sharp Edges
|
||||
|
||||
| Issue | Severity | Solution |
|
||||
|-------|----------|----------|
|
||||
| Issue | critical | ## Defense in depth - no single solution works |
|
||||
| Issue | medium | ## Add human-like variance to actions |
|
||||
| Issue | high | ## Use keyboard alternatives when possible |
|
||||
| Issue | medium | ## Accept the tradeoff |
|
||||
| Issue | high | ## Implement context management |
|
||||
| Issue | high | ## Monitor and limit costs |
|
||||
| Issue | critical | ## ALWAYS use sandboxing |
|
||||
Reference in New Issue
Block a user