Evals
Model Evaluation
Evaluation tasks are structured assessments used to measure the performance and capabilities of Large Language Models (LLMs). These tasks provide quantitative and qualitative insights into how well a model performs across various benchmarks.
By using Eval tasks, researchers can test model responses against specific datasets and scoring criteria, ensuring that their model’s capabilities align with intended use cases.
Overview
Each evaluation task is designed to test specific aspects of model performance, from basic comprehension to complex reasoning abilities, across various dimensions such as accuracy, reasoning, safety, and domain-specific knowledge.
Task Structure
The generic structure of a Task is defined under Core Resources. An Eval Task specifically consists of three core components:
- A prompt (question)
- A reference (base) model’s response
- A test model’s response
Additional context can be included when necessary for proper evaluation of the responses.
This structure enables direct comparison between different models’ outputs, allowing for systematic evaluation of model performance improvements or regressions.
Messages
Each turn consists of sequential messages that represent a user prompt, base model response and test model response.
User message
Contains the initial prompt or question (role: user
).
Base model response
Contains the reference model’s answer (role: assistant
)
Test model response
Contains the candidate model’s answer (role: assistant
)
Each message includes a source_id
that uniquely identifies the source that
generated the response. Possible sources are: user
, base_model
,
test_model
.
Message Annotations
Both base model and test model responses are evaluated across multiple dimensions, such as:
- Instruction following
- Truthfulness
- Conciseness
- Format adherence
- Safety
- Overall quality
Message’s annotations
include the evaluation results for each dimension.
The evaluation dimensions are flexible and can be customized based on project requirements and evaluation objectives.
Turn-Level Annotations
The annotations
at the turn level, specifies preference related or aggregated information. Some common examples are:
- Selected model identifier:
selected_model_id
- Likert scale rating:
likert_value
- Detailed justification for the selection:
justification
- Any other comparative analysis between model responses
Expanded Eval Task Output
This is a sample expanded sample Eval Task output returned by /v2/task
.