Preference model training
Last updated
Last updated
Preference modeling acts as the second stage in the RLHF process, serving as the bridge between the pretraining of the language model and its fine-tuning. In this phase, we essentially create a scoring function that can evaluate the quality of a text generated by the model based on human preferences.
To gather relevant data for this process, we rely on a set of prompt-generation pairs. These pairs are created by choosing prompts from a specific dataset and then generating responses using the language model. Human feedback providers then step in to rank these generated responses, creating a rich pool of data that reflects human preferences across various types of prompts and responses.
However, as human preferences can vary greatly from one individual to another, we use a ranking system instead of absolute scores. This allows us to capture the relative preference between different responses, which helps to reduce the variance and subjectivity in the feedback data.
The final preference model, which is a product of this meticulous process, is then used as a reward function in the fine-tuning phase. This reward function acts as a guide to the model, steering it towards generating responses that align more closely with human preferences. In other words, the preference model serves as the model's "conscience", guiding it to produce outputs that are more likely to resonate with human users.
Our contributions in the field of RLHF occur largely in this step of the process - by creating a preference model that aligns with human preferences and avoids over-fitting that can affect the LLM models adversely. The aim was simple: develop a function that evaluates a text sequence and produces a scalar reward representing human preference. We've tested system designs that ranged from end-to-end language models to modular systems that provide a reward, which another algorithm then ranks and converts into a numerical reward. For existing algorithms and methodologies to blend smoothly into the RLHF process, a scalar reward function is vital. We plan to disclose and open source our scoring methodology over time.
There are several ways to model this reward, but for Haptic, two setups are particularly relevant:
A model built from scratch based on preference data
A fine-tuned language model
Anthropic uses the first approach. They initiated the feedback process on LM models post pretraining (preference model pretraining) as they found it more sample-efficient than fine-tuning. However, the research community hasn't declared a clear winner for reward models.
The training dataset, formed of prompt-generation pairs, is created by selecting prompts from a predefined dataset. Anthropic uses a chat feature sourced from Amazon Mechanical Turks, while OpenAI uses prompts submitted by users to the GPT API. Human feedback providers then rank the text outputs generated by the LLMs. As human feedback is subjective and scalar scoring differs across providers, rankings are used to compare multiple model outputs and build a more robust dataset.
There are several ways to rank the text. One method involves users comparing the text generated by two language models based on the same prompt. By pitting model outputs against each other, we can use an ELO system (similar to the one used to score chess players) to rank the models. Haptic will employ different ranking methods for different users in a randomised way, and the aggregated results will be used to produce a scalar reward signal for training.