Evaluate AI Agents at Every Level
Annotate agent traces from any framework. Rate entire trajectories, individual steps, or specific reasoning errors. Watch agents work in real-time. Compare approaches side by side.
Five Purpose-Built Display Types
Each display type is optimized for a different agent modality — tool-use, web browsing, coding, chat, or live observation.
Agent Trace Display
Color-coded step cards for tool-using agents. Thought, action, observation, and error steps with collapsible sections and JSON pretty-printing.
Web Agent Trace Viewer
Screenshots with SVG overlays showing click locations, bounding boxes, and scroll paths. Filmstrip thumbnail bar for quick navigation.
Interactive Chat Display
Live conversation with AI agents or review of recorded conversations. Per-turn ratings appear inline below each message.
Live Agent Viewer
Real-time observation with pause, resume, send instructions, and take over controls. Streams agent actions via SSE as the agent works.
Coding Trace Display
Purpose-built for coding agents. Unified diff view, dark terminal blocks, line-numbered file reads, and a file tree sidebar.
Annotation Schemas for Agents
Purpose-built schemas for structured agent evaluation at trace, step, and comparison levels.
Trajectory Evaluation
Per-step error localization with hierarchical error taxonomies, severity scoring, and a running score tracker that decrements based on severity.
Rubric Evaluation
MT-Bench-style multi-criteria grid. Define custom criteria and rating scales. Annotators rate each dimension independently.
Pairwise Comparison
Compare two agent traces side by side. Three modes: binary preference, continuous scale, and per-dimension multi-criteria judgment.
Per-Turn Ratings
Attach rating schemes directly to conversation turns. Configure which speaker types get rated. Ratings appear inline below each turn.
Process Reward
Click the first wrong step and all subsequent are auto-marked. Or rate each step independently. Export directly to PRM training format.
Code Review
GitHub PR-style annotation with inline diff comments, file-level quality ratings, and approve/reject verdicts for coding agent output.
Import Traces from Any Framework
Potato converts traces from 13 agent frameworks into a universal format. Use the CLI converter or real-time webhook ingestion.
| Converter | Source | Key Features |
|---|---|---|
| LangChain / LangSmith | LangChain ecosystem | Hierarchical runs, tool calls |
| Langfuse | Langfuse observability | Observation spans, scores |
| OpenAI | OpenAI API | Function calling, assistants |
| Anthropic Claude | Anthropic API | Tool use, thinking blocks |
| MCP | Model Context Protocol | Tool + resource calls |
| OpenTelemetry | Distributed systems | Span hierarchy, attributes |
| ATIF | Academic format | Standard interchange |
| WebArena | Web benchmarks | Screenshots, element targeting |
| Raw Browser | Browser recordings | HAR + screenshots |
| Claude Code | Anthropic Messages API | Tool use blocks, code diffs |
| Aider | Aider chat sessions | Markdown edit blocks |
| SWE-Agent | Coding benchmarks | Thought/action/observation |
| ReAct | Generic agents | Thought/action/observation |
Evaluate Coding Agents
Purpose-built rendering for Claude Code, Aider, SWE-Agent, and other AI coding assistants with diff, terminal, and file displays.
Watch Agents Work in Real Time
Observe coding agents as they read files, edit code, and run tests. Intervene when they go wrong.
Ollama
Fully local, no API key required. Any Ollama-compatible model.
Anthropic API
Claude with tool use for coding agent sessions.
Claude Agent SDK
Full Claude Code capabilities with tool use and file operations.
How Potato Compares
The only free, self-hosted tool with coding agent diff rendering, PRM annotation, live observation, and 13-format trace conversion.
| Feature | Potato | LangSmith | Langfuse | Label Studio | Argilla | Scale AI |
|---|---|---|---|---|---|---|
| Trace format support | 13 formats | LangChain only | Langfuse only | Generic | Generic | Custom |
| Per-step annotation | trajectory_eval + PRM | Limited | Limited | Yes | No | Yes |
| Real-time agent observation | Yes | No | No | No | No | No |
| Agent pause/resume/takeover | Yes | No | No | No | No | No |
| Code diff rendering | Yes | No | No | No | No | No |
| Terminal output rendering | Yes | No | No | No | No | No |
| PRM data collection | Yes | No | No | No | No | No |
| Code review with inline comments | Yes | No | No | No | No | No |
| Pairwise agent comparison | 3 modes | No | No | No | No | Yes |
| Multi-criteria rubric | Yes | No | No | No | No | Yes |
| Self-hosted | Yes | No | Yes | Yes | Yes | No |
| Free | Yes | No | Partial | Partial | Yes | No |
14 Ready-to-Run Example Projects
Each example ships with configuration, sample data, and documentation. Run any example in under a minute.
Agent Trace Evaluation
Task success, MAST error taxonomy, per-turn ratings, and span annotation
Web Agent Review
Web browsing traces with screenshots, SVG overlays, and filmstrip
Web Agent Creation
Annotators browse the web; their interactions become trace data
Live Agent Evaluation
Watch an AI agent browse in real-time with pause and instruct controls
Interactive VLM Evaluation
Vision-language model observation with trajectory_eval scoring
SWE-bench Evaluation
Coding agent patch evaluation with diff rendering and PRM
Anthropic Evaluation
Claude tool-use trace evaluation with per-step correctness
OpenAI Evaluation
OpenAI function calling trace evaluation and error taxonomy
LangChain Integration
Real-time trace ingestion from LangSmith via webhook
Multi-Agent Evaluation
CrewAI, AutoGen, and LangGraph multi-agent coordination traces
Agent Comparison
Side-by-side A/B agent comparison with binary preference
Multi-Dimension Comparison
Per-dimension pairwise judgment with required justification
RAG Evaluation
RAG pipeline: retrieval relevance, faithfulness, and citations
Visual Agent Evaluation
GUI agent grounding accuracy and navigation scoring
Start Evaluating Agents Today
Install Potato and run your first agent evaluation in under five minutes. Free, open-source, and self-hosted.