Training language models to follow instructions with human feedback (4 Mar 2022)

Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

  • (DeepL) A larger language model does not necessarily mean that the user’s intentions will be met. For example, large language models may produce output that is untruthful, harmful, or simply not useful to the user. In other words, these models do not align with the user.
    • These models are not aligned with their users
  • In this paper, we show a path to align language models with user intent across a wide range of tasks by fine-tuning them with human feedback. Starting with a set of prompts written by the labeler and submitted through the OpenAI API, we collect a dataset that shows the labeler’s desired model behavior, which we use to fine-tune GPT-3 in supervised learning. We then collect a dataset of rankings for the model’s outputs and use reinforcement learning from human feedback (“reinforcement learning from human feedback”, RLHF) to further fine-tune this supervised model. The resulting model is called InstructGPT.
  • In human evaluations of our prompt distributions, the output of the 1.3 billion parameter InstructGPT model was preferred over the output of the 175B GPT-3, despite having 100 times fewer parameters. Furthermore, the InstructGPT model shows improved truthfulness and reduced detrimental output generation, with minimal performance regression on the public NLP dataset; although InstructGPT still makes simple mistakes, this result suggests that fine-tuning with human feedback is a promising direction for aligning language models with human intentions promising direction for aligning language models with human intent.

InstructGPT implementation description - Qiita


This page is auto-translated from /nishio/InstructGPT using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.