How a 1.3B parameter model trained with RLHF beats a 175B GPT-3 — by simply asking humans "which answer is better?" This is the paper that created InstructGPT and paved the way for ChatGPT.
Making language models bigger doesn't make them better at following user intent. GPT-3 predicts the next token — but users want helpful, honest, and harmless answers.
The language modeling objective is misaligned. Training on "predict the next token on the internet" is fundamentally different from "follow the user's instructions helpfully and safely." The paper proposes using Reinforcement Learning from Human Feedback (RLHF) to bridge this gap — aligning GPT-3 to human preferences using just ~77K data points.
The paper's core contribution: a 3-step pipeline that transforms a base GPT-3 into InstructGPT. Each step builds on the previous one.
Steps 2 and 3 can be iterated continuously: collect more comparison data on the current best policy → train a new RM → train a new PPO policy. In practice, most comparison data comes from SFT policies, with some from PPO policies.
Labelers write demonstrations of the ideal model behavior for each prompt. GPT-3 is then fine-tuned on these human-written outputs.
Think of SFT like a student-teacher relationship. The labeler shows the model the correct answer — "Here's a prompt, and here's exactly what the perfect response looks like." The model learns by imitating. But imitating is limited — we need the model to develop judgment about what makes output good. That's what Steps 2 and 3 do.
| Split | Source | SFT | RM | PPO |
|---|---|---|---|---|
| Train | Labeler-written | 11,295 | 6,623 | — |
| Train | Customer (API) | 1,430 | 26,584 | 31,144 |
| Validation | Labeler-written | 1,550 | 3,488 | — |
| Validation | Customer (API) | 103 | 14,399 | 16,185 |
| Total Train | — | 12,725 | 33,207 | 31,144 |
Instead of telling the model the right answer, we teach it to judge quality — by training a separate model to predict which output humans prefer.
The final step: use the reward model as a scoring function and fine-tune the SFT model to maximize that score using Proximal Policy Optimization.
The quality of InstructGPT depends entirely on the humans who labeled the data. This section covers who they were, how they were selected, and how the evaluation was structured.
InstructGPT at 1.3B parameters is preferred by humans over base GPT-3 at 175B — a 100× smaller model wins by being aligned.
The paper includes cherry-picked prompts (but not cherry-picked outputs) that illustrate InstructGPT's capabilities and failure modes (Figures 8-9).
Sections 4.2, 5.2, and 5.5 of the paper examine truthfulness, toxicity, bias, and what "alignment" really means.
The paper is remarkably candid about what InstructGPT still gets wrong — and what research questions remain (Sections 5.3-5.5).
Making models better at following instructions also makes them easier to misuse for misinformation and harassment. Alignment techniques are not a panacea — they're one tool in a broader safety ecosystem. The paper warns against deployment in high-stakes domains: medical diagnosis, classifying people on protected characteristics, credit/employment/housing eligibility, political ads, and law enforcement. If models are open-sourced, limiting harmful applications becomes challenging without regulation.
Check your understanding of the InstructGPT paper with these questions.
✅ Alignment is cheaper than scale: RLHF costs <2% of pretraining compute but is more effective than 100× scale increase.
✅ 1.3B beats 175B: A small aligned model is preferred over a 100× larger unaligned model.
✅ 3 steps: SFT (demonstrate) → RM (rank) → PPO (optimize). Simple pipeline, powerful results.
✅ Generalization: InstructGPT follows instructions in unseen domains (code, non-English) despite minimal supervision.
⚠️ Not solved: Bias, false premises, harmful instruction following, and representativeness remain open challenges.
🔮 Legacy: This paper directly led to ChatGPT — the most impactful AI product launch in history.