Ongoing RLHF adjacent research at Haptic

While RLHF has made strides, it's important to acknowledge its limitations. Operating in a human-centric problem domain means that the model will always be a work in progress, without a definitive finish line.

One of the main challenges is that human preference data can be costly, and non deterministic as it involves incorporating other human workers outside of the training loop. In this regard, crypto economics comes into play, helping us speed up the iteration loops. We can provide larger reward multipliers for same set of users to return for training the next iteration loop for an LLM model.

Secondly, the performance of RLHF is directly tied to the quality of its human annotations, a problem we're continuously striving to solve with our scoring matrices for feedback. Human annotators can often have differing opinions, which introduce a significant variance to the training data.The scoring matrix helps us decide what the contribution of a user has been to the system retroactively.

Also, currently the available datasets for RLHF on a general large language models are limited to one major dataset from Anthropic, and a few smaller-scale, task-specific datasets, such as the summarization data from OpenAI. We're confident that within its first year of operation, Haptic can significantly increase this number, and make the fine-tuning process affordable for academic researchers, decentralised LLMs and hobbyists.

Despite these limitations, RLHF has vast potential for improvement. There are many unexplored design options that could help RLHF make substantial progress. Many of these involve improving the RL optimizer. While PPO is a relatively old algorithm, there's no reason why other algorithms couldn't provide benefits and variations to the existing RLHF workflow.

To sidestep these costly forward passes of a large model, offline RL could be used as a policy optimizer. Recently, new algorithms like implicit language Q-learning (ILQL), have emerged that align well with this type of optimization. Other key trade-offs in the RL process, like exploration-exploitation balance, are yet to be documented. Exploring these directions could lead to a better understanding of how RLHF works and potentially improve performance.

We’re considering several strategies to optimize the RLHF system:

  1. Pre-training gradients: We propose collaborating with LLM projects to explore incorporating additional pre-training gradients into the PPO update rule for fine-tuning parameters.

  2. Meta-Learning: We could use additional machine learning algorithms to accelerate the RLHF system's learning process.

  3. Hierarchical RL: The learning problem could be broken down into a hierarchy of simpler problems, making the learning process more manageable and enabling more complex behaviours to be learned. This is particularly useful for complex tasks such as summarizing entire books.

  4. Transfer Learning: The RLHF system could be trained on one task and then have the learned knowledge transferred to another related task. This could speed up the learning process and improve performance on the new task.

  5. Inverse Reinforcement Learning: This approach involves learning the reward function directly from observed behaviour, which could be especially useful when direct feedback is limited or unavailable.

In the future, RLHF could continue to evolve by iteratively updating the reward model and the policy. As the RL policy updates, users can continue to rank these outputs against the model's earlier versions. Most research papers have yet to discuss implementing this operation, as the mode of deployment needed to collect this type of data only works for dialogue agents with access to an engaged user base. Anthropic discusses this option as Iterated Online RLHF (see the original paper), where iterations of the policy are included in the ELO ranking system across models. This introduces complex dynamics of the policy and reward model evolving, which presents an intriguing and open research question.

Last updated