Create a new model evaluation using an existing benchmark.
This endpoint initiates a model evaluation task using a benchmark from your library. Supports both single model evaluation and comparison mode where two models are evaluated side-by-side.
Request Body:
model_id: Primary model to evaluate (required)benchmark_id: ID of the benchmark to use for evaluation (required)model_id_2 (optional): Second model ID for comparison modecustom_metrics (optional): Custom evaluation metrics configurationReturns:
evaluation_id: Unique identifier for tracking the evaluationmodel_id: The primary model being evaluatedstatus: Initial status (always pending)benchmark_id: The benchmark being usedcreated_at: Timestamp when evaluation was createdmessage: Confirmation messageExample Request (Single Model):
POST /api/v3/evaluations
Headers: {"Authorization": "Bearer <api_key>"}
{
"model_id": "claude-3-sonnet-20240229",
"benchmark_id": "task-abc123",
"custom_metrics": ["accuracy", "relevance"]
}
Example Request (Comparison Mode):
POST /api/v3/evaluations
Headers: {"Authorization": "Bearer <api_key>"}
{
"model_id": "claude-3-sonnet-20240229",
"model_id_2": "gpt-4",
"benchmark_id": "task-abc123"
}
Example Response:
{
"evaluation_id": "eval-xyz789",
"model_id": "claude-3-sonnet-20240229",
"status": "pending",
"benchmark_id": "task-abc123",
"created_at": "2024-01-15T10:30:00Z",
"message": "Evaluation created successfully and queued for execution"
}
Notes:
model_id_2 to compare two models side-by-sideevaluation_id to check status/evaluations/{evaluation_id}/status to track progressBearer authentication header of the form Bearer <token>, where <token> is your auth token.
ID of the model to evaluate
"model-xyz789"
ID of existing benchmark from BenchmarkTask table
"benchmark-abc123"
Custom metrics configuration
ID of second model for comparison mode (eval-compare)
"model-def456"
Returns a unique identifier for the initiated evaluation along with the initial evaluation status. This endpoint starts an asynchronous evaluation process using a specified benchmark and model(s), allowing users to track progress and retrieve results once completed.
Unique identifier for the evaluation
"eval-abc123"
ID of the model being evaluated
"model-xyz789"
Current status of the evaluation
"pending"
Benchmark ID used
"benchmark-abc123"
timestamp of evaluation creation
"2024-02-24T10:00:00Z"
Status message
"Evaluation created successfully"