Fine tuning

Until a few years ago, many believed it was impossible to train a language model with reinforcement learning in AI. However, in the past 6 years, teams have made it a reality by adjusting parameters in the initial LLM using a Proximal Policy Optimization policy-gradient RL algorithm.

Some LLM parameters are fixed because adjusting a full model with billions of parameters is costly and impractical due to high computing demand. The decision on how many parameters to fix is up to the LLM developers, as we function as an external module.

How is the Reinforcement Learning problem formulated?

The language model accepts a prompt and produces a text sequence. In mathematical terms, the model's action space includes all tokens matching the language model's vocabulary, and the observation space is the distribution of possible input token sequences, which is vast due to the large text sequence prompts.

The reward function comes into play when the system merges responses from all models into one RLHF process. Given a prompt, A, the text B is generated by the current fine-tuned policy iteration. This text, combined with the original prompt, is sent to the preference model, which provides a scalar value for output preference. Furthermore, the Reinforcement policy's per-token probability distributions are compared with those from the initial model to calculate a penalty for the difference.

Most academic papers use Kullback–Leibler (KL) divergence for penalty calculation between distribution sequences over tokens. This prevents the policy from veering too far from the initial pre-trained model with each batch, ensuring the model infers coherent text snippets related to the original prompt. However, we utilize a newer method developed by the OpenAI team using a novel objective function, as defined below

L^{CLIP} (θ) = Ê_{t}[min(r_t(θ)) Â_t,clip(r_t(θ),1-ε,1+ε)Â_t]

Where,

θ - policy parameter

$Ê_{t}$ - empirical expectation over timesteps

$r_t$ - ratio of the probability under the new and old policies

$Â_t$ - estimated advantage at time t

ε - hyperparameter i.e. usually set at 0.1 or 0.2

This objective function enables us to perform a Trust Region update that aligns with Stochastic Gradient Descent. It simplifies the algorithm by eliminating the KL penalty and the need for adaptive updates, resulting in a more straightforward implementation.

Some of the AI projects that have been fine-tuned using Proximal Policy Optimization (PPO) include OpenAI's GPT-2 and GPT-3, DeepMind's Gopher, and Anthropic's AI models. These models have demonstrated significant improvements in their ability to generate coherent and contextually appropriate responses after fine-tuning with PPO. Our goal is to make these improvements accessible to all LLM developers and teams.

PreviousPreference model training NextSystem Overview

Last updated 1 year ago