Evaluation tasks are structured assessments used to measure the performance and capabilities of Large Language Models (LLMs). These tasks provide quantitative and qualitative insights into how well a model performs across various benchmarks.

By using Eval tasks, researchers can test model responses against specific datasets and scoring criteria, ensuring that their model’s capabilities align with intended use cases.

Overview

Each evaluation task is designed to test specific aspects of model performance, from basic comprehension to complex reasoning abilities, across various dimensions such as accuracy, reasoning, safety, and domain-specific knowledge.

Task Structure

The generic structure of a Task is defined under Core Resources. An Eval Task specifically consists of three core components:

  • A prompt (question)
  • A reference (base) model’s response
  • A test model’s response

Additional context can be included when necessary for proper evaluation of the responses.

This structure enables direct comparison between different models’ outputs, allowing for systematic evaluation of model performance improvements or regressions.

Messages

Each turn consists of sequential messages that represent a user prompt, base model response and test model response.

User message

Contains the initial prompt or question (role: user).

{
  "content": {
    "text": "This is an example prompt"
  },
  "role": "user",
  "source_id": "user",
  "annotations": []
}

Base model response

Contains the reference model’s answer (role: assistant)

{
  "content": {
    "text": "This is the base model response"
  },
  "role": "assistant",
  "source_id": "base_model",
  "annotations": [
    { "key": "instruction_following", "value": 3 },
    { "key": "truthfulness", "value": 2 },
    { "key": "conciseness", "value": 3 },
    { "key": "format", "value": 3 },
    { "key": "safety", "value": 3 },
    { "key": "overall", "value": 5 }
  ]
}

Test model response

Contains the candidate model’s answer (role: assistant)

{
  "content": {
    "text": "This is test model response"
  },
  "role": "assistant",
  "source_id": "test_model",
  "annotations": [
    { "key": "instruction_following", "value": 2 },
    { "key": "truthfulness", "value": 1 },
    {
      "key": "truthfulness_justification",
      "value": "The response incorrectly classifies..."
    },
    { "key": "conciseness", "value": 2 },
    { "key": "format", "value": 3 },
    { "key": "safety", "value": 3 },
    { "key": "overall", "value": 3 }
  ]
}

Each message includes a source_id that uniquely identifies the source that generated the response. Possible sources are: user, base_model, test_model.

Message Annotations

Both base model and test model responses are evaluated across multiple dimensions, such as:

  • Instruction following
  • Truthfulness
  • Conciseness
  • Format adherence
  • Safety
  • Overall quality

Message’s annotations include the evaluation results for each dimension.

The evaluation dimensions are flexible and can be customized based on project requirements and evaluation objectives.

Turn-Level Annotations

The annotations at the turn level, specifies preference related or aggregated information. Some common examples are:

  • Selected model identifier: selected_model_id
  • Likert scale rating: likert_value
  • Detailed justification for the selection: justification
  • Any other comparative analysis between model responses
[
  {
    "key": "selected_model_id",
    "value": "base_model"
  },
  {
    "key": "likert_value",
    "value": 2
  },
  {
    "key": "justification",
    "value": "@Response 1 is better than @Response 2. @Response 2 has an issue in Truthfulness ..."
  }
]

Expanded Eval Task Output

This is a sample expanded sample Eval Task output returned by /v2/task.

Sample Eval Task Output
{
  "task_id": "task_123",
  "project": "project_123",
  "batch": "batch_123",
  "status": "completed",
  "created_at": "2025-01-01T08:31:03.169Z",
  "completed_at": "2025-01-02T04:00:39.923Z",
  "threads": [
    {
      "id": "thread_0",
      "turns": [
        {
          "id": "turn_0",
          "messages": [
            {
              "content": {
                "text": "This is an example prompt"
              },
              "role": "user",
              "source_id": "user",
              "annotations": []
            },
            {
              "content": {
                "text": "This is the base model response"
              },
              "role": "assistant",
              "source_id": "base_model",
              "annotations": [
                { "key": "instruction_following", "value": 3 },
                { "key": "truthfulness", "value": 2 },
                { "key": "conciseness", "value": 3 },
                { "key": "format", "value": 3 },
                { "key": "safety", "value": 3 },
                { "key": "overall", "value": 5 }
              ]
            },
            {
              "content": {
                "text": "This is test model response"
              },
              "role": "assistant",
              "source_id": "test_model",
              "annotations": [
                { "key": "instruction_following", "value": 2 },
                { "key": "truthfulness", "value": 1 },
                { "key": "truthfulness_justification", "value": "The response incorrectly classifies..." },
                { "key": "conciseness", "value": 2 },
                { "key": "format", "value": 3 },
                { "key": "safety", "value": 3 },
                { "key": "overall", "value": 3 }
              ]
            }
          ],
          "annotations": [
            {
              "key": "selected_model_id",
              "value": "base_model"
            },
            {
              "key": "likert_value",
              "value": 2
            },
            {
              "key": "justification",
              "value": "@Response 1 is better than @Response 2. @Response 2 has an issue in Truthfulness ..."
            }
          ]
        }
      ],
      "annotations": []
    }
  ]
}