Reinforcement Learning from Human Feedback
RLHF tasks are structured datasets used to train and fine-tune Large Language Models (LLMs) using human feedback. These tasks help align model behavior with human preferences by providing paired responses with clear preference indicators.
RLHF tasks consist of prompts and multiple model-generated responses, where human annotators provide explicit preferences between responses. This feedback helps models learn which outputs are more desirable according to human judgment across various dimensions such as Helpfulness, Accuracy, Safety, Writing quality, and Task completion.
An RLHF task contains:
RLHF tasks can be followed by an SFT stage where the preferred model’s response is rewritten
Each turn consists of sequential messages that represent a user prompt, multiple model responses, and corresponding human preference annotations.
Contains the initial prompt or instruction (role: user
).
Contains responses from different models (role: assistant
). Optionally includes rewrite
annotations to store human rewritten responses.
Each model response includes a source_id
that uniquely identifies the model
that generated the response.
Each model response is evaluated across multiple dimensions, which may include:
If the model response is rewritten, the rewrite
annotation is added to the list.
Message’s annotations
include the ratings for each dimension.
The rating dimensions are flexible and can be customized based on project requirements and objectives.
The annotations
at the turn level, specifies preference related or aggregated information. Some common examples are:
selected_model_id
likert_value
justification
This is a sample expanded sample RLHF Task output returned by /v2/task
.
Reinforcement Learning from Human Feedback
RLHF tasks are structured datasets used to train and fine-tune Large Language Models (LLMs) using human feedback. These tasks help align model behavior with human preferences by providing paired responses with clear preference indicators.
RLHF tasks consist of prompts and multiple model-generated responses, where human annotators provide explicit preferences between responses. This feedback helps models learn which outputs are more desirable according to human judgment across various dimensions such as Helpfulness, Accuracy, Safety, Writing quality, and Task completion.
An RLHF task contains:
RLHF tasks can be followed by an SFT stage where the preferred model’s response is rewritten
Each turn consists of sequential messages that represent a user prompt, multiple model responses, and corresponding human preference annotations.
Contains the initial prompt or instruction (role: user
).
Contains responses from different models (role: assistant
). Optionally includes rewrite
annotations to store human rewritten responses.
Each model response includes a source_id
that uniquely identifies the model
that generated the response.
Each model response is evaluated across multiple dimensions, which may include:
If the model response is rewritten, the rewrite
annotation is added to the list.
Message’s annotations
include the ratings for each dimension.
The rating dimensions are flexible and can be customized based on project requirements and objectives.
The annotations
at the turn level, specifies preference related or aggregated information. Some common examples are:
selected_model_id
likert_value
justification
This is a sample expanded sample RLHF Task output returned by /v2/task
.