Evaluation tasks are structured assessments used to measure the performance and capabilities of Large Language Models (LLMs). These tasks provide quantitative and qualitative insights into how well a model performs across various benchmarks.
By using Eval tasks, researchers can test model responses against specific datasets and scoring criteria, ensuring that their model’s capabilities align with intended use cases.
Overview
Each evaluation task is designed to test specific aspects of model performance, from basic comprehension to complex reasoning abilities, across various dimensions such as accuracy, reasoning, safety, and domain-specific knowledge.
Task Structure
The generic structure of a Task is defined under Core Resources . An Eval Task specifically consists of three core components:
A prompt (question)
A reference (base) model’s response
A test model’s response
Additional context can be included when necessary for proper evaluation of the
responses.
This structure enables direct comparison between different models’ outputs, allowing for systematic evaluation of model performance improvements or regressions.
Messages
Each turn consists of sequential messages that represent a user prompt, base model response and test model response.
User message
Contains the initial prompt or question (role: user
).
Sample Prompt
get_eval_user_prompts.py
{
"content" : {
"text" : "This is an example prompt"
},
"role" : "user" ,
"source_id" : "user" ,
"annotations" : []
}
Base model response
Contains the reference model’s answer (role: assistant
)
Sample Base Model Response
get_eval_base_model_response.py
{
"content" : {
"text" : "This is the base model response"
},
"role" : "assistant" ,
"source_id" : "base_model" ,
"annotations" : [
{ "key" : "instruction_following" , "value" : 3 },
{ "key" : "truthfulness" , "value" : 2 },
{ "key" : "conciseness" , "value" : 3 },
{ "key" : "format" , "value" : 3 },
{ "key" : "safety" , "value" : 3 },
{ "key" : "overall" , "value" : 5 }
]
}
Test model response
Contains the candidate model’s answer (role: assistant
)
Sample Test Model Response
get_eval_test_model_response.py
{
"content" : {
"text" : "This is test model response"
},
"role" : "assistant" ,
"source_id" : "test_model" ,
"annotations" : [
{ "key" : "instruction_following" , "value" : 2 },
{ "key" : "truthfulness" , "value" : 1 },
{
"key" : "truthfulness_justification" ,
"value" : "The response incorrectly classifies..."
},
{ "key" : "conciseness" , "value" : 2 },
{ "key" : "format" , "value" : 3 },
{ "key" : "safety" , "value" : 3 },
{ "key" : "overall" , "value" : 3 }
]
}
Each message includes a source_id
that uniquely identifies the source that
generated the response. Possible sources are: user
, base_model
,
test_model
.
Message Annotations
Both base model and test model responses are evaluated across multiple dimensions, such as:
Instruction following
Truthfulness
Conciseness
Format adherence
Safety
Overall quality
Message’s annotations
include the evaluation results for each dimension.
The evaluation dimensions are flexible and can be customized based on project
requirements and evaluation objectives.
Turn-Level Annotations
The annotations
at the turn level, specifies preference related or aggregated information. Some common examples are:
Selected model identifier: selected_model_id
Likert scale rating: likert_value
Detailed justification for the selection: justification
Any other comparative analysis between model responses
Sample Turn-Level Annotations
get_turn_annotations.py
[
{
"key" : "selected_model_id" ,
"value" : "base_model"
},
{
"key" : "likert_value" ,
"value" : 2
},
{
"key" : "justification" ,
"value" : "@Response 1 is better than @Response 2. @Response 2 has an issue in Truthfulness ..."
}
]
Expanded Eval Task Output
This is a sample expanded sample Eval Task output returned by /v2/task
.
{
"task_id" : "task_123" ,
"project" : "project_123" ,
"batch" : "batch_123" ,
"status" : "completed" ,
"created_at" : "2025-01-01T08:31:03.169Z" ,
"completed_at" : "2025-01-02T04:00:39.923Z" ,
"threads" : [
{
"id" : "thread_0" ,
"turns" : [
{
"id" : "turn_0" ,
"messages" : [
{
"content" : {
"text" : "This is an example prompt"
},
"role" : "user" ,
"source_id" : "user" ,
"annotations" : []
},
{
"content" : {
"text" : "This is the base model response"
},
"role" : "assistant" ,
"source_id" : "base_model" ,
"annotations" : [
{ "key" : "instruction_following" , "value" : 3 },
{ "key" : "truthfulness" , "value" : 2 },
{ "key" : "conciseness" , "value" : 3 },
{ "key" : "format" , "value" : 3 },
{ "key" : "safety" , "value" : 3 },
{ "key" : "overall" , "value" : 5 }
]
},
{
"content" : {
"text" : "This is test model response"
},
"role" : "assistant" ,
"source_id" : "test_model" ,
"annotations" : [
{ "key" : "instruction_following" , "value" : 2 },
{ "key" : "truthfulness" , "value" : 1 },
{ "key" : "truthfulness_justification" , "value" : "The response incorrectly classifies..." },
{ "key" : "conciseness" , "value" : 2 },
{ "key" : "format" , "value" : 3 },
{ "key" : "safety" , "value" : 3 },
{ "key" : "overall" , "value" : 3 }
]
}
],
"annotations" : [
{
"key" : "selected_model_id" ,
"value" : "base_model"
},
{
"key" : "likert_value" ,
"value" : 2
},
{
"key" : "justification" ,
"value" : "@Response 1 is better than @Response 2. @Response 2 has an issue in Truthfulness ..."
}
]
}
],
"annotations" : []
}
]
}