Pretraining

Pretraining in Reinforcement Learning from Human Feedback (RLHF) is the first crucial initial step that lays the foundation for the rest of the process. It essentially involves training a language model on a large corpus of text data, usually scraped from the internet, using standard techniques like next-token prediction. This stage aims to equip the model with a broad understanding of language structure, syntax, and basic world knowledge.

When a language model is pretrained, it learns to generate coherent and contextually appropriate text by predicting the next word in a sentence based on the preceding words. This allows the model to understand and generate text that is syntactically correct and semantically relevant. The large-scale data used in pretraining ensures that the model is exposed to a wide range of topics, styles, and contexts, which greatly enhances its versatility and adaptability.

However, pretraining alone is not sufficient to ensure that the model's outputs align precisely with human values or specific task requirements. This is where RLHF comes in. The pretrained model serves as a basis upon which further fine-tuning can be performed using human feedback, allowing the model to refine its understanding and generation of text to better meet the needs and preferences of users. In this sense, pretraining can be seen as providing a strong starting point from which RLHF can build.

We have seen this setup and process used in every LLM model that we interact with today.

DeepMind has documented using up to their 280 billion parameter model Gopher and further utilising RLHF for improving its capabilities.
Anthropic used transformer models from 10 million to 52 billion parameters trained for this task. They generated their initial language model for Reinforcement learning by distilling an original LM on context clues for their “helpful, honest, and harmless” criteria.
OpenAI used a smaller version of GPT-3 for its first popular RLHF model - InstructGPT. OpenAI fine-tuned models on human-generated text that was “preferable” by the human reviewers

The performance of these models also highly depends on the type of training data it has been built around. A model trained for disseminating subject specific information performs highly when utilised for limited repository of questions associated with the subject. These limited models are also computationally much cheaper to train and retrain.

All the companies mentioned above likely use much larger models in most recent RLHF-powered products. As evident core to starting the RLHF process is having a model that responds well to diverse instructions. In general, there is not a clear answer on “which model” is the best for the starting point of RLHF. Thus in our implementation at Haptic, we remain model agnostic supporting models such as Claude, Chat GPT, Grok, BARD, Monai and other popular LLM models that receive interest from the community.

PreviousMethodology Followed NextPreference model training

Last updated 1 year ago