New in v2.4

Evaluate AI Agents at Every Level

Annotate agent traces from any framework. Rate entire trajectories, individual steps, or specific reasoning errors. Watch agents work in real-time. Compare approaches side by side.

5Display Types

13Trace Formats

14Example Projects

3Live Backends

Five Purpose-Built Display Types

Each display type is optimized for a different agent modality — tool-use, web browsing, coding, chat, or live observation.

🔄

Agent Trace Display

Color-coded step cards for tool-using agents. Thought, action, observation, and error steps with collapsible sections and JSON pretty-printing.

🌐

Web Agent Trace Viewer

Screenshots with SVG overlays showing click locations, bounding boxes, and scroll paths. Filmstrip thumbnail bar for quick navigation.

💬

Interactive Chat Display

Live conversation with AI agents or review of recorded conversations. Per-turn ratings appear inline below each message.

📡

Live Agent Viewer

Real-time observation with pause, resume, send instructions, and take over controls. Streams agent actions via SSE as the agent works.

💻

Coding Trace Display

Purpose-built for coding agents. Unified diff view, dark terminal blocks, line-numbered file reads, and a file tree sidebar.

Annotation Schemas for Agents

Purpose-built schemas for structured agent evaluation at trace, step, and comparison levels.

trajectory_eval

Trajectory Evaluation

Per-step error localization with hierarchical error taxonomies, severity scoring, and a running score tracker that decrements based on severity.

rubric_eval

Rubric Evaluation

MT-Bench-style multi-criteria grid. Define custom criteria and rating scales. Annotators rate each dimension independently.

pairwise

Pairwise Comparison

Compare two agent traces side by side. Three modes: binary preference, continuous scale, and per-dimension multi-criteria judgment.

per_turn_rating

Per-Turn Ratings

Attach rating schemes directly to conversation turns. Configure which speaker types get rated. Ratings appear inline below each turn.

process_reward

Process Reward

Click the first wrong step and all subsequent are auto-marked. Or rate each step independently. Export directly to PRM training format.

code_review

Code Review

GitHub PR-style annotation with inline diff comments, file-level quality ratings, and approve/reject verdicts for coding agent output.

Import Traces from Any Framework

Potato converts traces from 13 agent frameworks into a universal format. Use the CLI converter or real-time webhook ingestion.

Converter	Source	Key Features
LangChain / LangSmith	LangChain ecosystem	Hierarchical runs, tool calls
Langfuse	Langfuse observability	Observation spans, scores
OpenAI	OpenAI API	Function calling, assistants
Anthropic Claude	Anthropic API	Tool use, thinking blocks
MCP	Model Context Protocol	Tool + resource calls
OpenTelemetry	Distributed systems	Span hierarchy, attributes
ATIF	Academic format	Standard interchange
WebArena	Web benchmarks	Screenshots, element targeting
Raw Browser	Browser recordings	HAR + screenshots
Claude Code	Anthropic Messages API	Tool use blocks, code diffs
Aider	Aider chat sessions	Markdown edit blocks
SWE-Agent	Coding benchmarks	Thought/action/observation
ReAct	Generic agents	Thought/action/observation

Evaluate Coding Agents

Purpose-built rendering for Claude Code, Aider, SWE-Agent, and other AI coding assistants with diff, terminal, and file displays.

Unified diff view with red/green highlighting

Dark terminal blocks for bash output

File tree sidebar showing all files touched

Process reward annotation for PRM training

GitHub PR-style inline code review

Converters for Claude Code, Aider, SWE-Agent

# Quick start

pip install potato-annotation

potato start examples/agent-traces/coding-agent-eval/config.yaml -p 8000

Watch Agents Work in Real Time

Observe coding agents as they read files, edit code, and run tests. Intervene when they go wrong.

Ollama

Fully local, no API key required. Any Ollama-compatible model.

Anthropic API

Claude with tool use for coding agent sessions.

Claude Agent SDK

Full Claude Code capabilities with tool use and file operations.

Pause / Resume

Send Instructions

Rollback to Checkpoint

Branch & Replay

How Potato Compares

The only free, self-hosted tool with coding agent diff rendering, PRM annotation, live observation, and 13-format trace conversion.

Feature	Potato	LangSmith	Langfuse	Label Studio	Argilla	Scale AI
Trace format support	13 formats	LangChain only	Langfuse only	Generic	Generic	Custom
Per-step annotation	trajectory_eval + PRM	Limited	Limited	Yes	No	Yes
Real-time agent observation	Yes	No	No	No	No	No
Agent pause/resume/takeover	Yes	No	No	No	No	No
Code diff rendering	Yes	No	No	No	No	No
Terminal output rendering	Yes	No	No	No	No	No
PRM data collection	Yes	No	No	No	No	No
Code review with inline comments	Yes	No	No	No	No	No
Pairwise agent comparison	3 modes	No	No	No	No	Yes
Multi-criteria rubric	Yes	No	No	No	No	Yes
Self-hosted	Yes	No	Yes	Yes	Yes	No
Free	Yes	No	Partial	Partial	Yes	No

14 Ready-to-Run Example Projects

Each example ships with configuration, sample data, and documentation. Run any example in under a minute.

Agent Trace Evaluation

Task success, MAST error taxonomy, per-turn ratings, and span annotation

Web Agent Review

Web browsing traces with screenshots, SVG overlays, and filmstrip

Web Agent Creation

Annotators browse the web; their interactions become trace data

Live Agent Evaluation

Watch an AI agent browse in real-time with pause and instruct controls

Interactive VLM Evaluation

Vision-language model observation with trajectory_eval scoring

SWE-bench Evaluation

Coding agent patch evaluation with diff rendering and PRM

Anthropic Evaluation

Claude tool-use trace evaluation with per-step correctness

OpenAI Evaluation

OpenAI function calling trace evaluation and error taxonomy

LangChain Integration

Real-time trace ingestion from LangSmith via webhook

Multi-Agent Evaluation

CrewAI, AutoGen, and LangGraph multi-agent coordination traces

Agent Comparison

Side-by-side A/B agent comparison with binary preference

Multi-Dimension Comparison

Per-dimension pairwise judgment with required justification

RAG Evaluation

RAG pipeline: retrieval relevance, faithfulness, and citations

Visual Agent Evaluation

GUI agent grounding accuracy and navigation scoring

Start Evaluating Agents Today

Install Potato and run your first agent evaluation in under five minutes. Free, open-source, and self-hosted.

Get Started Read the Docs