Create a new model evaluation using an existing benchmark from the database Supports both single model and comparison mode (if model_id_2 is provided)
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Request schema for creating a new model evaluation
ID of the model to evaluate
ID of existing benchmark from BenchmarkTask table
Judge model provider (anthropic, openai, nugen)
Custom metrics configuration
ID of second model for comparison mode (eval-compare)
Successful Response
Response schema for evaluation creation
Unique identifier for the evaluation
ID of the model being evaluated
Current status of the evaluation
Benchmark ID used
Judge model provider used
ISO timestamp of evaluation creation
Status message