Rubrics
Contributor defined rating criteria
Rubrics tasks are a specific type of Evaluation task where models are assessed based on a dynamic rubric, which is defined by the contributor during the task. These tasks provide quantitative and qualitative insight into how well a model can follow arbitrary criteria.
Overview
Each rubrics task is designed to test specific aspects of model performance, from basic comprehension to complex reasoning abilities, across natural human defined criteria.
Task Structure
A Rubrics task consists of:
- A prompt (question or instruction)
- A list of model evaluation criteria (contributor defined based on the prompt)
- Multiple model generated responses
- Human evaluation of the model responses against the criteria
Additional annotations can be included when necessary for standard evaluation of the responses.
Messages
Each turn consists of sequential messages that represent a user prompt, multiple model responses, and corresponding human evaluation of the model responses against the criteria.
User message
Contains the initial prompt or instruction (role: user
).
Model Responses
Contains responses from different models (role: assistant
).
Each model response includes a source_id
that uniquely identifies the model
that generated the response.
Message Annotations
Each model response is evaluated across the contributor-defined criteria. For example:
- “The response must display information in a table.”
- “The response must sort the names in alphabetical order.”
- “The response must use metric units.”
Message’s annotations
include the evaluation results for each dimension.
The rubric criteria is unique to the prompt and is different per task. Additional evaluation dimensions can be included based on project requirements and evaluation objectives.
Turn-Level Annotations
The annotations
at the turn level, specifies contributor-defined rubrics, preference related or aggregated information. Some common examples are:
- Selected model identifier:
selected_model_id
- Likert scale rating:
likert_value
- Detailed justification for the selection:
justification
- Any other comparative analysis between model responses
Expanded Rubrics Task Output
This is a sample expanded sample Rubrics Task output returned by /v2/task
.