Create Evaluation

curl --request POST \
  --url https://api.nugen.in/api/v3/evaluations \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model_id": "model-xyz789",
  "benchmark_id": "benchmark-abc123",
  "custom_metrics": {},
  "model_id_2": "model-def456"
}
'

{
  "evaluation_id": "eval-abc123",
  "model_id": "model-xyz789",
  "status": "pending",
  "benchmark_id": "benchmark-abc123",
  "created_at": "2024-02-24T10:00:00Z",
  "message": "Evaluation created successfully"
}

Evaluations

Create Evaluation

Create a new model evaluation using an existing benchmark.

This endpoint initiates a model evaluation task using a benchmark from your library. Supports both single model evaluation and comparison mode where two models are evaluated side-by-side.

Request Body:

model_id: Primary model to evaluate (required)
benchmark_id: ID of the benchmark to use for evaluation (required)
model_id_2 (optional): Second model ID for comparison mode
custom_metrics (optional): Custom evaluation metrics configuration

Returns:

evaluation_id: Unique identifier for tracking the evaluation
model_id: The primary model being evaluated
status: Initial status (always pending)
benchmark_id: The benchmark being used
created_at: Timestamp when evaluation was created
message: Confirmation message

Example Request (Single Model):

POST /api/v3/evaluations
Headers: {"Authorization": "Bearer <api_key>"}

{
  "model_id": "claude-3-sonnet-20240229",
  "benchmark_id": "task-abc123",
  "custom_metrics": ["accuracy", "relevance"]
}

Example Request (Comparison Mode):

POST /api/v3/evaluations
Headers: {"Authorization": "Bearer <api_key>"}

{
  "model_id": "claude-3-sonnet-20240229",
  "model_id_2": "gpt-4",
  "benchmark_id": "task-abc123"
}

Example Response:

{
  "evaluation_id": "eval-xyz789",
  "model_id": "claude-3-sonnet-20240229",
  "status": "pending",
  "benchmark_id": "task-abc123",
  "created_at": "2024-01-15T10:30:00Z",
  "message": "Evaluation created successfully and queued for execution"
}

Notes:

Single model mode: Evaluates one model against the benchmark
Comparison mode: Provide model_id_2 to compare two models side-by-side
Evaluation runs asynchronously - use the returned evaluation_id to check status
Use /evaluations/{evaluation_id}/status to track progress

POST

api

evaluations

Create Evaluation

curl --request POST \
  --url https://api.nugen.in/api/v3/evaluations \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model_id": "model-xyz789",
  "benchmark_id": "benchmark-abc123",
  "custom_metrics": {},
  "model_id_2": "model-def456"
}
'

{
  "evaluation_id": "eval-abc123",
  "model_id": "model-xyz789",
  "status": "pending",
  "benchmark_id": "benchmark-abc123",
  "created_at": "2024-02-24T10:00:00Z",
  "message": "Evaluation created successfully"
}

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

model_id

string

required

ID of the model to evaluate

Example:

"model-xyz789"

benchmark_id

string

required

ID of existing benchmark from BenchmarkTask table

Example:

"benchmark-abc123"

custom_metrics

Custom Metrics · object

Custom metrics configuration

model_id_2

string | null

ID of second model for comparison mode (eval-compare)

Example:

"model-def456"

Response

Returns a unique identifier for the initiated evaluation along with the initial evaluation status. This endpoint starts an asynchronous evaluation process using a specified benchmark and model(s), allowing users to track progress and retrieve results once completed.

evaluation_id

string

required

Unique identifier for the evaluation

Example:

"eval-abc123"

model_id

string

required

ID of the model being evaluated

Example:

"model-xyz789"

status

string

required

Current status of the evaluation

Example:

"pending"

benchmark_id

string

required

Benchmark ID used

Example:

"benchmark-abc123"

created_at

string

required

timestamp of evaluation creation

Example:

"2024-02-24T10:00:00Z"

message

string

required

Status message

Example:

"Evaluation created successfully"

List User Evaluations Get Evaluation Status

Models

Agents

Documents

Benchmark

Evaluations

Alignment Project

Inference

Create Evaluation

Authorizations

Body

Response