Create a new model evaluation using an existing benchmark.
This endpoint initiates a model evaluation task using a benchmark from your library. Supports both single model evaluation and comparison mode where two models are evaluated side-by-side.
Request Body:
model_id: Primary model to evaluate (required)benchmark_id: ID of the benchmark to use for evaluation (required)model_id_2 (optional): Second model ID for comparison modecustom_metrics (optional): Custom evaluation metrics configurationReturns:
evaluation_id: Unique identifier for tracking the evaluationmodel_id: The primary model being evaluatedstatus: Initial status (always pending)benchmark_id: The benchmark being usedcreated_at: Timestamp when evaluation was createdmessage: Confirmation messageExample Request (Single Model):
POST /api/v3/evaluations
Headers: {"Authorization": "Bearer <api_key>"}
{
"model_id": "claude-3-sonnet-20240229",
"benchmark_id": "task-abc123",
"custom_metrics": ["accuracy", "relevance"]
}
Example Request (Comparison Mode):
POST /api/v3/evaluations
Headers: {"Authorization": "Bearer <api_key>"}
{
"model_id": "claude-3-sonnet-20240229",
"model_id_2": "gpt-4",
"benchmark_id": "task-abc123"
}
Example Response:
{
"evaluation_id": "eval-xyz789",
"model_id": "claude-3-sonnet-20240229",
"status": "pending",
"benchmark_id": "task-abc123",
"created_at": "2024-01-15T10:30:00Z",
"message": "Evaluation created successfully and queued for execution"
}
Notes:
model_id_2 to compare two models side-by-sideevaluation_id to check status/evaluations/{evaluation_id}/status to track progressBearer authentication header of the form Bearer <token>, where <token> is your auth token.
Request schema for creating a new model evaluation
Returns a unique identifier for the initiated evaluation along with the initial evaluation status. This endpoint starts an asynchronous evaluation process using a specified benchmark and model(s), allowing users to track progress and retrieve results once completed.
Response schema for evaluation creation
Unique identifier for the evaluation
ID of the model being evaluated
Current status of the evaluation
Benchmark ID used
ISO timestamp of evaluation creation
Status message