Overview
Each rubrics task is designed to test specific aspects of model performance, from basic comprehension to complex reasoning abilities, across natural human defined criteria.Task Structure
A Rubrics task consists of:- A prompt (question or instruction)
- A list of model evaluation criteria (contributor defined based on the prompt)
- Multiple model generated responses
- Human evaluation of the model responses against the criteria
Additional annotations can be included when necessary for standard evaluation
of the responses.
Messages
Each turn consists of sequential messages that represent a user prompt, multiple model responses, and corresponding human evaluation of the model responses against the criteria.User message
Contains the initial prompt or instruction (role:user
).
Model Responses
Contains responses from different models (role:assistant
).
Each model response includes a
source_id
that uniquely identifies the model
that generated the response.Message Annotations
Each model response is evaluated across the contributor-defined criteria. For example:- “The response must display information in a table.”
- “The response must sort the names in alphabetical order.”
- “The response must use metric units.”
annotations
include the evaluation results for each dimension.
The rubric criteria is unique to the prompt and is different per task.
Additional evaluation dimensions can be included based on project requirements
and evaluation objectives.
Turn-Level Annotations
Theannotations
at the turn level, specifies contributor-defined rubrics, preference related or aggregated information. Some common examples are:
- Selected model identifier:
selected_model_id
- Likert scale rating:
likert_value
- Detailed justification for the selection:
justification
- Any other comparative analysis between model responses
Expanded Rubrics Task Output
This is a sample expanded sample Rubrics Task output returned by/v2/task
.
Sample Rubrics Task Output