Docs API GitHub Introduction

Core concepts Tracing & instrumentation Replay Evals & eval-rigor Prompts Datasets & comparisons Agent surface & MCP Self-hosting

Guides

Datasets & comparisons

Curate runs into datasets and compare versions side by side.

Datasets turn ad-hoc runs into a fixed set you can evaluate and regress against; comparisons put two versions next to each other.

Datasets

A dataset is a named collection of items (inputs, and optionally reference outputs):

POST /v1/datasets creates one; GET /v1/datasets lists them.
GET|POST /v1/datasets/{dataset_id}/items reads and adds items.
POST /v1/runs/_actions/add-to-dataset promotes real runs straight from /runs into a dataset — the fastest way to build a regression set from production traffic you've already seen.

Point an eval-run at a dataset to score a whole set with a judge panel and reliability metrics.

Comparisons

A comparison holds the results of running two configurations over the same inputs, so you can read the delta rather than eyeball two tabs:

POST /v1/comparisons creates a comparison; GET /v1/comparisons/{id} and .../items read it back.

Paired with replay, a comparison answers "did this change actually help, across the set?" — not just "did it help on this one run?"

See the API Reference for the full datasets and comparisons endpoints.

Prompts

Version prompts, pin aliases, and iterate in the playground.

Agent surface & MCP

The same debugger, built for agents. Token-budgeted, LLM-legible run views over REST and MCP — so an agent can debug an agent.

On this page

Datasets Comparisons