Skip to content
New in v2.4

Evaluate AI Agents at Every Level

Annotate agent traces from any framework. Rate entire trajectories, individual steps, or specific reasoning errors. Watch agents work in real-time. Compare approaches side by side.

5Display Types
13Trace Formats
14Example Projects
3Live Backends

Five Purpose-Built Display Types

Each display type is optimized for a different agent modality — tool-use, web browsing, coding, chat, or live observation.

🔄

Agent Trace Display

Color-coded step cards for tool-using agents. Thought, action, observation, and error steps with collapsible sections and JSON pretty-printing.

🌐

Web Agent Trace Viewer

Screenshots with SVG overlays showing click locations, bounding boxes, and scroll paths. Filmstrip thumbnail bar for quick navigation.

💬

Interactive Chat Display

Live conversation with AI agents or review of recorded conversations. Per-turn ratings appear inline below each message.

📡

Live Agent Viewer

Real-time observation with pause, resume, send instructions, and take over controls. Streams agent actions via SSE as the agent works.

💻

Coding Trace Display

Purpose-built for coding agents. Unified diff view, dark terminal blocks, line-numbered file reads, and a file tree sidebar.

Annotation Schemas for Agents

Purpose-built schemas for structured agent evaluation at trace, step, and comparison levels.

trajectory_eval

Trajectory Evaluation

Per-step error localization with hierarchical error taxonomies, severity scoring, and a running score tracker that decrements based on severity.

rubric_eval

Rubric Evaluation

MT-Bench-style multi-criteria grid. Define custom criteria and rating scales. Annotators rate each dimension independently.

pairwise

Pairwise Comparison

Compare two agent traces side by side. Three modes: binary preference, continuous scale, and per-dimension multi-criteria judgment.

per_turn_rating

Per-Turn Ratings

Attach rating schemes directly to conversation turns. Configure which speaker types get rated. Ratings appear inline below each turn.

process_reward

Process Reward

Click the first wrong step and all subsequent are auto-marked. Or rate each step independently. Export directly to PRM training format.

code_review

Code Review

GitHub PR-style annotation with inline diff comments, file-level quality ratings, and approve/reject verdicts for coding agent output.

Import Traces from Any Framework

Potato converts traces from 13 agent frameworks into a universal format. Use the CLI converter or real-time webhook ingestion.

ConverterSourceKey Features
LangChain / LangSmithLangChain ecosystemHierarchical runs, tool calls
LangfuseLangfuse observabilityObservation spans, scores
OpenAIOpenAI APIFunction calling, assistants
Anthropic ClaudeAnthropic APITool use, thinking blocks
MCPModel Context ProtocolTool + resource calls
OpenTelemetryDistributed systemsSpan hierarchy, attributes
ATIFAcademic formatStandard interchange
WebArenaWeb benchmarksScreenshots, element targeting
Raw BrowserBrowser recordingsHAR + screenshots
Claude CodeAnthropic Messages APITool use blocks, code diffs
AiderAider chat sessionsMarkdown edit blocks
SWE-AgentCoding benchmarksThought/action/observation
ReActGeneric agentsThought/action/observation

Evaluate Coding Agents

Purpose-built rendering for Claude Code, Aider, SWE-Agent, and other AI coding assistants with diff, terminal, and file displays.

Unified diff view with red/green highlighting
Dark terminal blocks for bash output
File tree sidebar showing all files touched
Process reward annotation for PRM training
GitHub PR-style inline code review
Converters for Claude Code, Aider, SWE-Agent
# Quick start
pip install potato-annotation
potato start examples/agent-traces/coding-agent-eval/config.yaml -p 8000

Watch Agents Work in Real Time

Observe coding agents as they read files, edit code, and run tests. Intervene when they go wrong.

Ollama

Fully local, no API key required. Any Ollama-compatible model.

Anthropic API

Claude with tool use for coding agent sessions.

Claude Agent SDK

Full Claude Code capabilities with tool use and file operations.

Pause / Resume
Send Instructions
Rollback to Checkpoint
Branch & Replay

How Potato Compares

The only free, self-hosted tool with coding agent diff rendering, PRM annotation, live observation, and 13-format trace conversion.

FeaturePotatoLangSmithLangfuseLabel StudioArgillaScale AI
Trace format support13 formatsLangChain onlyLangfuse onlyGenericGenericCustom
Per-step annotationtrajectory_eval + PRMLimitedLimitedYesNoYes
Real-time agent observationYesNoNoNoNoNo
Agent pause/resume/takeoverYesNoNoNoNoNo
Code diff renderingYesNoNoNoNoNo
Terminal output renderingYesNoNoNoNoNo
PRM data collectionYesNoNoNoNoNo
Code review with inline commentsYesNoNoNoNoNo
Pairwise agent comparison3 modesNoNoNoNoYes
Multi-criteria rubricYesNoNoNoNoYes
Self-hostedYesNoYesYesYesNo
FreeYesNoPartialPartialYesNo

14 Ready-to-Run Example Projects

Each example ships with configuration, sample data, and documentation. Run any example in under a minute.

Agent Trace Evaluation

Task success, MAST error taxonomy, per-turn ratings, and span annotation

Web Agent Review

Web browsing traces with screenshots, SVG overlays, and filmstrip

Web Agent Creation

Annotators browse the web; their interactions become trace data

Live Agent Evaluation

Watch an AI agent browse in real-time with pause and instruct controls

Interactive VLM Evaluation

Vision-language model observation with trajectory_eval scoring

SWE-bench Evaluation

Coding agent patch evaluation with diff rendering and PRM

Anthropic Evaluation

Claude tool-use trace evaluation with per-step correctness

OpenAI Evaluation

OpenAI function calling trace evaluation and error taxonomy

LangChain Integration

Real-time trace ingestion from LangSmith via webhook

Multi-Agent Evaluation

CrewAI, AutoGen, and LangGraph multi-agent coordination traces

Agent Comparison

Side-by-side A/B agent comparison with binary preference

Multi-Dimension Comparison

Per-dimension pairwise judgment with required justification

RAG Evaluation

RAG pipeline: retrieval relevance, faithfulness, and citations

Visual Agent Evaluation

GUI agent grounding accuracy and navigation scoring

Start Evaluating Agents Today

Install Potato and run your first agent evaluation in under five minutes. Free, open-source, and self-hosted.