Retrieve the complete results of a finished evaluation.
This endpoint returns detailed evaluation metrics and scores for a completed evaluation. Works for both single model evaluations and comparison mode evaluations.
Path Parameters:
evaluation_id: Unique evaluation identifierReturns:
evaluation_id: The evaluation identifiermodel_id: Primary model that was evaluatedbenchmark_id: Benchmark that was usedstatus: Evaluation status (should be completed)raw_answers_count: Number of raw answers generated during evaluationcompleted_at: ISO timestamp when evaluation finishedmethod (optional): Evaluation method (eval for single model, eval-compare for comparison)metrics (optional): Evaluation metrics and scores (single model only)model_id_2 (optional): Second model ID (comparison mode only)base_model (optional): Base model results (comparison mode only)eval_model (optional): Eval model results (comparison mode only)comparison (optional): Comparison results between models (comparison mode only)Raises:
404: If evaluation not found or doesn’t belong to the authenticated user400: If evaluation is not yet completedExample Request:
GET /api/v3/evaluations/eval-xyz789/results
Headers: {"Authorization": "Bearer <api_key>"}
Example Response (Single Model):
{
"evaluation_id": "eval-xyz789",
"model_id": "nugen-flash-instruct",
"benchmark_id": "task-abc123",
"status": "completed",
"method": "eval",
"raw_answers_count": 10,
"completed_at": "2024-01-15T10:45:00Z",
"metrics": {
"accuracy": 0.92,
"relevance": 0.88,
"average_score": 0.90,
"total_questions": 10,
"correct_answers": 9
}
}
Example Response (Comparison Mode):
{
"evaluation_id": "eval-xyz789",
"model_id": "nugen-flash-instruct",
"model_id_2": "gpt-4",
"benchmark_id": "task-abc123",
"status": "completed",
"method": "eval-compare",
"raw_answers_count": 20,
"completed_at": "2024-01-15T10:45:00Z",
"base_model": {
"model_id": "nugen-flash-instruct",
"average_score": 0.92,
"total_questions": 10
},
"eval_model": {
"model_id": "gpt-4",
"average_score": 0.85,
"total_questions": 10
},
"comparison": {
"winner": "nugen-flash-instruct",
"score_difference": 0.07,
"statistical_significance": true
}
}
Notes:
metrics field for resultsbase_model, eval_model, and comparison fields insteadmethod field indicates whether it’s a single model (eval) or comparison (eval-compare) evaluation/evaluations/{evaluation_id}/status first to check if evaluation is complete/evaluations/{evaluation_id}/downloadBearer authentication header of the form Bearer <token>, where <token> is your auth token.
Returns detailed evaluation metrics and scores for a completed evaluation. This endpoint provides comprehensive results for a finished evaluation, including all relevant metrics, scores, and comparison data if applicable. Use this to analyze the performance of the evaluated model(s) against the benchmark once the evaluation is complete.
Response schema for evaluation results
Unique identifier for the evaluation
ID of the model that was evaluated
Benchmark ID used
Evaluation status
Number of raw answers generated
ISO timestamp when evaluation completed
Evaluation metrics and scores (single model)
Evaluation method: 'eval' or 'eval-compare'
ID of second model (for comparison)
Base model results (comparison mode)
Eval model results (comparison mode)
Comparison results between models