Claude Computer Use: Setup, Capabilities, and Practical Limitations
Claude's computer use capability lets Claude control a desktop environment — take screenshots, move the mouse, click, type, and execute commands — to complete tasks that require a graphical interface. As of 2026, it works reliably for structured, well-defined tasks (form filling, data entry, file management) but remains unreliable for tasks requiring complex visual reasoning or multi-step decisions under ambiguity. If you need browser automation with stable HTML selectors, conventional tools like Playwright are faster and more reliable. Computer use is the right choice when no stable API or selector-based approach exists.
What computer use actually is
Computer use is a tool set that Anthropic provides via the API. When enabled, Claude can:
- Take a screenshot to see the current state of the screen
- Move the mouse to a specific (x, y) coordinate
- Click (left, right, double)
- Type text
- Press keyboard shortcuts
- Run terminal commands
Claude observes the screen through screenshots, decides what to do, and calls these tools in sequence. It's not pre-programmed automation — Claude is reasoning about what it sees and determining each action.
When to use computer use vs alternatives
Use computer use when:
- The target application has no API
- Web scraping/Playwright can't reliably identify elements (rendered canvas, legacy Flash-style apps, proprietary desktop software)
- The task requires contextual visual judgement (filling out a form where the fields are dynamic based on previous answers)
- You need to automate a desktop application (Excel, Photoshop, legacy enterprise software)
Use Playwright/Selenium instead when:
- The task is web-based with stable HTML structure
- Speed matters (screenshot+reasoning cycles are slow — 2–5 seconds per action)
- The task is highly repetitive at scale (computer use costs much more per action than Playwright)
Use a direct API instead when:
- The target service has an API (use it — always faster, cheaper, more reliable)
Setup: running computer use with Docker
Anthropic provides a reference implementation using Docker:
# Clone the Anthropic quickstarts repo
git clone https://github.com/anthropics/anthropic-quickstarts
cd anthropic-quickstarts/computer-use-demo
# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...
# Run the Docker container (includes VNC, desktop environment, Chrome)
docker build -t computer-use-demo .
docker run \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
-v $HOME/.anthropic:/home/user/.anthropic \
-p 5900:5900 \ # VNC port
-p 8501:8501 \ # Streamlit UI port
-p 6080:6080 \ # noVNC (browser-based VNC)
computer-use-demo
Access the interface at http://localhost:8501 (Streamlit UI) or http://localhost:6080 (browser-based desktop view).
The computer use API call
The core API pattern is simple: include the computer use tools in your tools list and handle tool_use blocks:
import anthropic
import base64
client = anthropic.Anthropic()
def run_computer_task(task: str) -> str:
"""
Run a computer use task. Returns the final response text.
"""
messages = [{"role": "user", "content": task}]
while True:
response = client.messages.create(
model="claude-opus-4-0", # Computer use requires Opus
max_tokens=4096,
tools=[
{
"type": "computer_20241022",
"name": "computer",
"display_width_px": 1366,
"display_height_px": 768,
"display_number": 1,
},
{
"type": "bash_20241022",
"name": "bash",
},
{
"type": "text_editor_20241022",
"name": "str_replace_editor",
},
],
messages=messages,
)
# If no tool use, we're done
if response.stop_reason == "end_turn":
return next(
(b.text for b in response.content if hasattr(b, "text")),
"Task completed."
)
# Process tool calls
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
# Add assistant message and tool results to continue
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
def execute_tool(tool_name: str, tool_input: dict) -> str:
"""Execute a tool call and return the result."""
if tool_name == "computer":
action = tool_input["action"]
if action == "screenshot":
screenshot_data = take_screenshot() # your screenshot implementation
return [{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": screenshot_data}}]
elif action == "left_click":
click(tool_input["coordinate"])
return "Clicked"
elif action == "type":
type_text(tool_input["text"])
return "Typed"
# ... handle other actions
return "Tool executed"
Practical reliability patterns
Computer use requires careful prompting to be reliable. These patterns improve completion rates:
1. Specify the exact success state
task = """
Navigate to https://example.com/forms/contact.
Fill in:
- Name: John Smith
- Email: john@example.com
- Subject: Demo request
- Message: I'd like to schedule a demo.
Click Submit.
STOP when you see a confirmation message on screen.
If you see an error, report it and stop.
"""
Without an explicit stop condition, Claude may keep clicking around after completing the task.
2. Break large tasks into subtasks
Instead of "log in and download all invoices from the last 3 months," break it into:
- "Log in to https://billing.example.com with username/password X/Y"
- "Navigate to the invoices section and identify invoices from January–March 2026"
- "Download each invoice by clicking the Download button for each"
Each subtask is verifiable. If one fails, you know where it broke.
3. Add verification steps
task = """
Fill out the form at https://example.com/survey.
After clicking Submit:
1. Take a screenshot
2. Confirm the confirmation message is visible
3. Report the exact text of the confirmation message
"""
This produces an audit trail and catches silent failures (the form appeared to submit but actually didn't).
Current limitations (2026)
Latency: Each action cycle (screenshot → decision → action) takes 2–8 seconds. A task requiring 20 actions takes 1–3 minutes. Not suitable for real-time use cases.
OCR reliability: Claude reads screen text from screenshots. Small fonts, low-contrast text, and complex layouts reduce reliability. Standard UI components work well; custom-rendered interfaces are unpredictable.
Multi-monitor support: Computer use works reliably on a single monitor. Multi-monitor setups require careful coordinate mapping.
Dynamic content: JavaScript-rendered content that changes after load (infinite scroll, lazy loading) requires explicit waiting instructions. Add "wait for the page to fully load before proceeding" when necessary.
File system limitations: Computer use can interact with files visible in the desktop GUI. For programmatic file operations, use the bash tool directly.
Cost: Computer use tasks are expensive. Each screenshot + decision cycle uses ~1,000–3,000 tokens. A 20-step task at Opus pricing costs $0.30–$1.00. For tasks over 50 steps, cost can exceed $5 per run.
Frequently asked questions
Does computer use require Claude Opus, or can I use Sonnet? Anthropic recommends Claude Opus for computer use due to its superior visual reasoning. Sonnet can be used but produces less reliable results on complex interfaces.
Can computer use access any website, including authenticated ones? Yes, if you configure the browser session with the correct cookies or credentials. You can import browser cookies into the Docker environment or use the bash tool to log in before the main task.
Is computer use available on the free tier? Computer use requires API access (not claude.ai). It's billed at standard Opus token rates. There's no separate computer use fee — you pay for the tokens consumed.
Can I run computer use without Docker? Yes, if you provide your own screenshot/click/type infrastructure and implement the tool execution handlers. The Docker setup is a reference implementation. For production use, you'll likely build custom tooling.
What's the difference between computer use and Claude Code? Claude Code is a CLI agent that operates on your local filesystem and runs commands via a terminal. Computer use controls a full graphical desktop. Use Claude Code for software development tasks; use computer use for GUI applications and web automation without APIs.
Related guides
- Claude Agent SDK: Build Your First Agent in 30 Minutes — agent fundamentals
- Deploying Claude Agents to Production: Fly.io, Vercel, and Lambda — production deployment for agent workloads
Take It Further
Claude Agent SDK Cookbook: 40 Production Patterns — Pattern 35 covers Computer Use in production: reliable task decomposition, error recovery, cost management, and the hybrid approach (computer use for setup + API for data extraction) that reduces costs by 70%.
→ Get the Agent SDK Cookbook — $49
30-day money-back guarantee. Instant download.