Guides
Datasets & comparisons
Curate runs into datasets and compare versions side by side.
Datasets turn ad-hoc runs into a fixed set you can evaluate and regress against; comparisons put two versions next to each other.
Datasets
A dataset is a named collection of items (inputs, and optionally reference outputs):
POST /v1/datasetscreates one;GET /v1/datasetslists them.GET|POST /v1/datasets/{dataset_id}/itemsreads and adds items.POST /v1/runs/_actions/add-to-datasetpromotes real runs straight from/runsinto a dataset — the fastest way to build a regression set from production traffic you've already seen.
Point an eval-run at a dataset to score a whole set with a judge panel and reliability metrics.
Comparisons
A comparison holds the results of running two configurations over the same inputs, so you can read the delta rather than eyeball two tabs:
POST /v1/comparisonscreates a comparison;GET /v1/comparisons/{id}and.../itemsread it back.
Paired with replay, a comparison answers "did this change actually help, across the set?" — not just "did it help on this one run?"
See the API Reference for the full datasets and comparisons
endpoints.