Evals

Evaluation tasks are structured assessments used to measure the performance and capabilities of Large Language Models (LLMs). These tasks provide quantitative and qualitative insights into how well a model performs across various benchmarks.

By using Eval tasks, researchers can test model responses against specific datasets and scoring criteria, ensuring that their model’s capabilities align with intended use cases.

Overview

Each evaluation task is designed to test specific aspects of model performance, from basic comprehension to complex reasoning abilities, across various dimensions such as accuracy, reasoning, safety, and domain-specific knowledge.

Task Structure

The generic structure of a Task is defined under Core Resources. An Eval Task specifically consists of three core components:

A prompt (question)
A reference (base) model’s response
A test model’s response

Additional context can be included when necessary for proper evaluation of the responses.

This structure enables direct comparison between different models’ outputs, allowing for systematic evaluation of model performance improvements or regressions.

Messages

Each turn consists of sequential messages that represent a user prompt, base model response and test model response.

User message

Contains the initial prompt or question (role: user).

{
  "content": {
    "text": "This is an example prompt"
  },
  "role": "user",
  "source_id": "user",
  "annotations": []
}

Base model response

Contains the reference model’s answer (role: assistant)

{
  "content": {
    "text": "This is the base model response"
  },
  "role": "assistant",
  "source_id": "base_model",
  "annotations": [
    { "key": "instruction_following", "value": 3 },
    { "key": "truthfulness", "value": 2 },
    { "key": "conciseness", "value": 3 },
    { "key": "format", "value": 3 },
    { "key": "safety", "value": 3 },
    { "key": "overall", "value": 5 }
  ]
}

Test model response

Contains the candidate model’s answer (role: assistant)

{
  "content": {
    "text": "This is test model response"
  },
  "role": "assistant",
  "source_id": "test_model",
  "annotations": [
    { "key": "instruction_following", "value": 2 },
    { "key": "truthfulness", "value": 1 },
    {
      "key": "truthfulness_justification",
      "value": "The response incorrectly classifies..."
    },
    { "key": "conciseness", "value": 2 },
    { "key": "format", "value": 3 },
    { "key": "safety", "value": 3 },
    { "key": "overall", "value": 3 }
  ]
}

Each message includes a source_id that uniquely identifies the source that generated the response. Possible sources are: user, base_model, test_model.

Message Annotations

Both base model and test model responses are evaluated across multiple dimensions, such as:

Instruction following
Truthfulness
Conciseness
Format adherence
Safety
Overall quality

Message’s annotations include the evaluation results for each dimension.

The evaluation dimensions are flexible and can be customized based on project requirements and evaluation objectives.

Turn-Level Annotations

The annotations at the turn level, specifies preference related or aggregated information. Some common examples are:

Selected model identifier: selected_model_id
Likert scale rating: likert_value
Detailed justification for the selection: justification
Any other comparative analysis between model responses

[
  {
    "key": "selected_model_id",
    "value": "base_model"
  },
  {
    "key": "likert_value",
    "value": 2
  },
  {
    "key": "justification",
    "value": "@Response 1 is better than @Response 2. @Response 2 has an issue in Truthfulness ..."
  }
]

Expanded Eval Task Output

This is a sample expanded sample Eval Task output returned by /v2/task.

Sample Eval Task Output

{
  "task_id": "task_123",
  "project": "project_123",
  "batch": "batch_123",
  "status": "completed",
  "created_at": "2025-01-01T08:31:03.169Z",
  "completed_at": "2025-01-02T04:00:39.923Z",
  "threads": [
    {
      "id": "thread_0",
      "turns": [
        {
          "id": "turn_0",
          "messages": [
            {
              "content": {
                "text": "This is an example prompt"
              },
              "role": "user",
              "source_id": "user",
              "annotations": []
            },
            {
              "content": {
                "text": "This is the base model response"
              },
              "role": "assistant",
              "source_id": "base_model",
              "annotations": [
                { "key": "instruction_following", "value": 3 },
                { "key": "truthfulness", "value": 2 },
                { "key": "conciseness", "value": 3 },
                { "key": "format", "value": 3 },
                { "key": "safety", "value": 3 },
                { "key": "overall", "value": 5 }
              ]
            },
            {
              "content": {
                "text": "This is test model response"
              },
              "role": "assistant",
              "source_id": "test_model",
              "annotations": [
                { "key": "instruction_following", "value": 2 },
                { "key": "truthfulness", "value": 1 },
                { "key": "truthfulness_justification", "value": "The response incorrectly classifies..." },
                { "key": "conciseness", "value": 2 },
                { "key": "format", "value": 3 },
                { "key": "safety", "value": 3 },
                { "key": "overall", "value": 3 }
              ]
            }
          ],
          "annotations": [
            {
              "key": "selected_model_id",
              "value": "base_model"
            },
            {
              "key": "likert_value",
              "value": 2
            },
            {
              "key": "justification",
              "value": "@Response 1 is better than @Response 2. @Response 2 has an issue in Truthfulness ..."
            }
          ]
        }
      ],
      "annotations": []
    }
  ]
}

Get Started

API Reference

Project Archetypes

Endpoints

Overview

Task Structure

Messages

User message

Base model response

Test model response

Message Annotations

Turn-Level Annotations

Expanded Eval Task Output

Get Started

API Reference

Project Archetypes

Endpoints

​Overview

​Task Structure

​Messages

​User message

​Base model response

​Test model response

​Message Annotations

​Turn-Level Annotations

​Expanded Eval Task Output

Overview

Task Structure

Messages

User message

Base model response

Test model response

Message Annotations

Turn-Level Annotations

Expanded Eval Task Output