ProblemMethodSFTReward PPODataResultsExamples SafetyLimitsQuiz
Ouyang, Wu, Jiang et al. · OpenAI · 2022

Training Language Models to Follow
Instructions with Human Feedback

How a 1.3B parameter model trained with RLHF beats a 175B GPT-3 — by simply asking humans "which answer is better?" This is the paper that created InstructGPT and paved the way for ChatGPT.

Explore the Method → Read Paper
1.3B
Beats 175B GPT-3
40
Human Labelers
3
Step Pipeline
85%
Preferred over GPT-3
Chapter 01

The Alignment Problem

Making language models bigger doesn't make them better at following user intent. GPT-3 predicts the next token — but users want helpful, honest, and harmless answers.

🚫
Misaligned GPT-3
  • Makes up facts (hallucination rate ~41%)
  • Generates toxic or biased text
  • Doesn't follow instructions — just continues text
  • Objective: "predict next word" ≠ "be helpful"
  • Answering "why eat socks?" with fake philosophy
Aligned InstructGPT
  • Follows instructions reliably
  • Hallucinates ~50% less (21% vs 41%)
  • 25% less toxic output (when prompted respectfully)
  • Objective: "maximize human preference"
  • Politely says "that question has a false premise"
The Core Insight

The language modeling objective is misaligned. Training on "predict the next token on the internet" is fundamentally different from "follow the user's instructions helpfully and safely." The paper proposes using Reinforcement Learning from Human Feedback (RLHF) to bridge this gap — aligning GPT-3 to human preferences using just ~77K data points.

What does "Aligned" mean? (3H Framework from Askell et al.)
🤝
Helpful
Should help the user solve their task — follow instructions and infer intent.
🔍
Honest
Shouldn't fabricate info or mislead. Measured via TruthfulQA and hallucination rates.
🛡️
Harmless
Should not cause physical, psychological, or social harm. Measured via RealToxicityPrompts.
RLHF Research Lineage (Section 2 — Related Work)
Christiano et al. 2017
Deep RL from Human Preferences. First proposed using human feedback as a reward signal for RL agents.
Ziegler et al. 2019
Applied RLHF to stylistic text continuation. Showed human preferences can fine-tune language models.
Stiennon et al. 2020
Learning to Summarize from Human Feedback. Direct predecessor — same pipeline applied to summarization.
InstructGPT 2022
This paper. Scales RLHF to a broad distribution of real-world tasks, not just summarization. 3-step pipeline.
Also related: Askell et al. 2021 (language assistants for alignment), Gabriel 2020 (values & alignment philosophy), Wu et al. 2021 (book summarization with RLHF). FLAN and T0 use NLP task instructions but NOT human preferences.
Chapter 02

The 3-Step RLHF Pipeline

The paper's core contribution: a 3-step pipeline that transforms a base GPT-3 into InstructGPT. Each step builds on the previous one.

Supervised Fine-Tuning (SFT)
Humans write ideal answers → fine-tune GPT-3 on these demonstrations. ~13K prompts, 16 epochs.
Reward Model Training (RM)
Humans rank multiple model outputs → train a 6B model to predict which output is preferred. ~33K prompts.
PPO Reinforcement Learning
Use the reward model as a scoring function → fine-tune the SFT model using PPO to maximize reward. ~31K prompts.
API Prompt Distribution (Table 1 — Full Breakdown)
45.6%
Generation
12.4%
Open QA
11.2%
Brainstorming
8.4%
Chat
6.6%
Rewrite
4.2%
Summarization
3.5%
Classification
3.5%
Other
2.6%
Closed QA
1.9%
Extract
~57% of prompts are open-ended generation/brainstorming — tasks public NLP datasets don't cover well. Classification + QA = only ~18%.
Key Design Insight

Steps 2 and 3 can be iterated continuously: collect more comparison data on the current best policy → train a new RM → train a new PPO policy. In practice, most comparison data comes from SFT policies, with some from PPO policies.

Chapter 03

Step 1: Supervised Fine-Tuning

Labelers write demonstrations of the ideal model behavior for each prompt. GPT-3 is then fine-tuned on these human-written outputs.

Training Details
  • Dataset: ~13,000 prompts (API + labeler-written)
  • Epochs: 16 (with cosine LR decay)
  • Dropout: Residual dropout of 0.2
  • Selection: Best checkpoint chosen by RM score on validation
  • Overfitting: Validation loss overfits after 1 epoch, but more training helps human preference
Prompt Types
  • Plain: Labelers write arbitrary tasks with diversity
  • Few-shot: Instruction + multiple query/response pairs
  • User-based: Prompts inspired by real API use cases from waitlist
  • Language: 96%+ English
  • Bootstrap: Needed initial labeler prompts since GPT-3 API didn't receive instruction-style prompts
Analogy: Teaching by Example

Think of SFT like a student-teacher relationship. The labeler shows the model the correct answer — "Here's a prompt, and here's exactly what the perfect response looks like." The model learns by imitating. But imitating is limited — we need the model to develop judgment about what makes output good. That's what Steps 2 and 3 do.

Dataset Sizes (Table 6 from Paper)
SplitSourceSFTRMPPO
TrainLabeler-written11,2956,623
TrainCustomer (API)1,43026,58431,144
ValidationLabeler-written1,5503,488
ValidationCustomer (API)10314,39916,185
Total Train12,72533,20731,144
SFT has more labeler prompts (bootstrapped with synthetic few-shot examples). RM training uses all C(K,2) pairs per prompt, so actual ranked pairs are an order of magnitude larger. Train/val/test split by user ID to prevent overlap.
Chapter 04

Step 2: Reward Model Training

Instead of telling the model the right answer, we teach it to judge quality — by training a separate model to predict which output humans prefer.

How It Works
  • Architecture: 6B parameter model (SFT with final layer removed)
  • Input: Prompt + response → outputs a scalar reward score
  • Training data: ~33K prompts with K=4 to K=9 ranked outputs each
  • Efficiency: Each set of K outputs → C(K,2) comparisons in one batch
  • Why 6B? 175B RM training was unstable; 6B saves compute and works well
Labeler Agreement
  • Training labelers: 72.6 ± 1.5% agreement rate
  • Held-out labelers: 77.3 ± 1.3% agreement rate
  • Baseline (Stiennon et al.): 73 ± 4% researcher agreement
  • RM accuracy: 72.4% on training labelers, 69.6% on held-out
  • Labelers: ~40 contractors via Upwork/Scale AI, selected by screening test
Reward Model Loss Function (Equation 1)
loss(θ) = −(1/C(K,2)) · E[x, yw, yl] [ log(σ(rθ(x, yw) − rθ(x, yl))) ]
rθ(x, y)
Reward
Scalar score the RM assigns to prompt x and response y
yw, yl
Winner/Loser
The preferred (winner) and non-preferred (loser) responses
σ
Sigmoid
Converts reward difference into a probability of preference
C(K,2)
Combinations
All pairwise comparisons from K ranked outputs per prompt
Chapter 05

Step 3: PPO Reinforcement Learning

The final step: use the reward model as a scoring function and fine-tune the SFT model to maximize that score using Proximal Policy Optimization.

PPO-ptx Objective Function (Equation 2)
objective(ϕ) = E[x,y][rθ(x,y) − β·log(πRLSFT)] + γ·Epretrain[log πRL(x)]
rθ(x,y)
Reward
Score from the trained reward model
β·log(π/πSFT)
KL Penalty
Prevents the RL model from drifting too far from the SFT model
γ·E[log π(x)]
Pretraining Mix
PPO-ptx: mix in pretraining loss to prevent forgetting NLP tasks
PPO vs PPO-ptx
PPO: Pure RL fine-tuning (γ=0). Can regress on NLP benchmarks. PPO-ptx: Mixes pretraining gradients (γ>0) to avoid "alignment tax." This is the default InstructGPT model.
KL Penalty Explained
Without the KL penalty, the RL model would "hack" the reward model by finding degenerate outputs that trick it. The per-token KL divergence keeps the model close to the SFT baseline.
Alignment Tax
RLHF causes regression on SQuAD, DROP, HellaSwag, and WMT15 FR→EN. PPO-ptx mitigates this by preserving language modeling capabilities while aligning.
Training Compute
SFT 175B: 4.9 petaflop/s-days. PPO-ptx 175B: 60 petaflop/s-days. GPT-3 pretraining: 3,640. Alignment costs <2% of pretraining.
PPO Training Environment (Section 3.5)
Bandit Environment
The RL environment is a bandit: it presents a random customer prompt, expects a response, produces a reward from the RM, and ends the episode. No multi-turn interaction.
Value Function Init
The PPO value function is initialized from the reward model. This gives a warm start — the value function already understands output quality from RM training.
PPO-ptx vs KL Fix
Increasing the KL coefficient alone never fully recovers DROP/SQuAD performance and significantly decreases reward. Pretraining mix is strictly better than the KL-only fix (Figures 33-34).
Chapter 06

Data & Human Labelers

The quality of InstructGPT depends entirely on the humans who labeled the data. This section covers who they were, how they were selected, and how the evaluation was structured.

Labeler Selection Process (Appendix B.1)
  • Team size: ~40 contractors via Upwork and Scale AI
  • Criterion 1: Agreement on sensitive speech flagging (75%+ threshold)
  • Criterion 2: Agreement on output rankings vs researcher labels
  • Criterion 3: Sensitive demonstration writing quality (6/7+ Likert score)
  • Criterion 4: Self-assessed ability to identify sensitive speech across cultures
  • Communication: Shared chat room for edge-case questions, detailed instructions
Labeler Demographics (Table 12)
  • Gender: 50% male, 44.4% female, 5.6% nonbinary/other
  • Ethnicity: 52.6% Southeast Asian, 31.6% White, 15.8% Latinx, 10.5% Black
  • Nationality: 22% Filipino, 22% Bangladeshi, 17% American, and others
  • Age: 75% under 35 years old
  • Language: ~96% of dataset classified as English (likely 99%+ in reality)
  • Satisfaction: Labelers enjoyed the task and found pay fair (19 respondents)
Evaluation Metadata Categories (Table 3)
Overall Quality
Likert 1-7
Fails Instruction
Binary
Hallucination
Binary
Satisfies Constraints
Binary
Inappropriate
Binary
Sexual Content
Binary
Violent Content
Binary
Harmful Advice
Binary
Labeling Priority Shift
During Training
Labelers prioritized helpfulness as the most important criterion (above truthfulness and harmlessness). This maximizes task completion quality.
During Final Evaluation
Labelers prioritized truthfulness and harmlessness over helpfulness. This reflects what the researchers "really care about" for deployment safety.
Data deduplication: long common prefix check, max 200 prompts per user ID. PII filtered from training split. Train/val/test split by user ID.
Chapter 07

Results & Evaluations

InstructGPT at 1.3B parameters is preferred by humans over base GPT-3 at 175B — a 100× smaller model wins by being aligned.

Preferred over GPT-3 175B
85%
175B InstructGPT vs 175B GPT-3
Hallucination Reduction
21%
vs 41% for GPT-3 (closed-domain)
TruthfulQA Improvement
Truthful+Informative answers vs GPT-3
Toxicity Reduction
25%
Less toxic when prompted respectfully
Win Rate Against SFT 175B (Figure 1)
Even at 1.3B, PPO-ptx outperforms 175B SFT. 175B PPO-ptx significantly preferred over all baselines.
Metadata Evaluations (Figure 4)
Follows Instructions
PPO models attempt the correct instruction ~95% of the time vs ~85% for GPT-3.
Follows Constraints
PPO models follow explicit constraints (e.g. "write in 2 paragraphs") ~80% of the time vs ~50% for GPT-3.
Appropriate Language
PPO outputs rated as appropriate for a customer assistant ~90% of the time.
FLAN & T0 Comparison
InstructGPT preferred over FLAN 78% of time, over T0 79%. NLP datasets alone can't align models.
NLP Benchmark Regressions (Section 4.2 — Alignment Tax Details)
PPO Performance Drops
Pure PPO significantly regresses on SQuAD v2, DROP, HellaSwag, and WMT 2015 FR→EN translation compared to the base GPT-3 model.
PPO-ptx Recovery
PPO-ptx mitigates regressions on all datasets, even surpasses GPT-3 on HellaSwag. Still lags behind on DROP, SQuADv2, and translation.
Why Not Just Increase KL?
Increasing the KL coefficient leads to significant decreases in validation reward and never fully recovers DROP and SQuAD. Pretraining mix is strictly better (Figures 33-34).
GPT-3 vs Prompt Distribution
Results hold even when evaluated on prompts submitted to GPT-3 models (not InstructGPT), though PPO-ptx performs slightly worse at larger model sizes on that distribution.
Chapter 08

Qualitative Examples

The paper includes cherry-picked prompts (but not cherry-picked outputs) that illustrate InstructGPT's capabilities and failure modes (Figures 8-9).

✅ Following Non-English Instructions (Figure 8)
Prompt: "Écrivez une courte histoire sur une grenouille qui voyage dans le temps jusqu'à la Grèce antique en français."
GPT-3 175B
Just listed more writing prompts in French — didn't follow the instruction to write the story.
InstructGPT 175B
Actually wrote a coherent short story about a frog time-traveling to ancient Greece, entirely in French.
✅ Answering Questions About Code (Figure 8)
Prompt: "What is the purpose of the list C in the code below?" [binomial_coefficient function]
GPT-3 175B
Generated multiple-choice options (A/B/C/D) instead of explaining the code.
InstructGPT 175B
Explained that list C stores binomial coefficient values as the function iterates. (Not fully correct, but attempts the right task.)
🚫 False Premise Failure (Figure 9)
Prompt: "Why is it important to eat socks after meditating?"
GPT-3 175B
Generated absurd Q&A about "tasting enlightenment" and "smelly feet" — at least feels like nonsense.
InstructGPT 175B
Wrote a long paragraph treating the premise as plausible, citing fake theories. Goes along with the false premise.
⚠️ Over-Hedging (Figure 9)
Prompt: "What happens if you fire a cannonball directly at a pumpkin at high speeds?"
GPT-3 175B
Made up false physics: "A pumpkin is a strong magnet that can manipulate metal objects."
InstructGPT 175B
Hedged excessively: "no way to accurately predict" — instead of saying the pumpkin would likely explode on impact.
Why These Behaviors Emerge
Hedging: Labelers were instructed to reward epistemic humility → RM learned to prefer cautious responses → PPO optimizes for hedging. False premises: Very few prompts with false premises in training data → model doesn't generalize to rejecting them. Both behaviors could be reduced with adversarial data collection (Dinan et al., 2019).
Chapter 09

Safety, Bias & Truthfulness

Sections 4.2, 5.2, and 5.5 of the paper examine truthfulness, toxicity, bias, and what "alignment" really means.

Truthfulness (TruthfulQA)
PPO models generate truthful+informative answers ~2× as often as GPT-3. When given "Instruction+QA" prompt to say "I don't know," PPO models err on the side of being honest rather than confidently wrong.
Toxicity (RealToxicityPrompts)
With "respectful" instructions, InstructGPT generates 25% less toxic output. Without the prompt, no improvement. When explicitly prompted to be toxic, InstructGPT is MORE toxic than GPT-3.
Bias (Winogender/CrowS-Pairs)
InstructGPT does NOT improve over GPT-3 on bias benchmarks. When instructed to act respectfully, it actually shows higher bias — it becomes more certain, not less biased.
Who Are We Aligning To? (Section 5.2)
Labeler Preferences
40 contractors (mostly English-speaking, US/SE Asia). ~73% inter-annotator agreement. Not representative of all users.
Researcher Influence
OpenAI researchers wrote the labeling instructions, answered edge-case questions, and designed the evaluation framework.
API Customer Bias
Training prompts come from OpenAI API users — selected from a waitlist, biased toward OpenAI's network. Not representative of all AI users.
Generalization
Held-out labelers (not in training) prefer InstructGPT at similar rates — suggesting the model generalizes beyond its specific training labelers.
Chapter 10

Limitations & Open Questions

The paper is remarkably candid about what InstructGPT still gets wrong — and what research questions remain (Sections 5.3-5.5).

False Premises
InstructGPT can be confused by instructions with false premises and simply goes along with them (e.g., "why is it important to eat socks after meditating?").
Hedging
Sometimes gives long, hedging answers to simple questions. Likely because labelers rewarded epistemic humility, and the RM learned to prefer cautious responses.
Follows Harmful Instructions
In most cases, InstructGPT follows user instructions even if harmful. When prompted to be maximally biased, it's MORE toxic than GPT-3. The greatest limitation per the authors.
Not Representative
40 labelers ≠ all of humanity. Most comparisons labeled by only 1 contractor. When labelers disagree, aligning to the average may not be desirable.
Alignment Tax
RLHF causes performance drops on SQuAD, DROP, HellaSwag, and translation. PPO-ptx mitigates but doesn't fully solve this. Pretraining mix may also reintroduce undesirable behaviors.
Generalization
InstructGPT can follow code and non-English instructions despite minimal training data. But it sometimes responds in English even when prompted in another language.
Methodology Limits
Labeler value judgments are impacted by identity, cultural backgrounds, and personal history. Labelers are primarily English-speaking. Data is almost entirely English instructions.
Model Safety
Models still generate toxic/biased outputs, make up facts, and generate sexual/violent content without explicit prompting. Not fully aligned, not fully safe.
Open Questions (Section 5.4)
Adversarial Data Collection
Have labelers find worst-case model behaviors, then add those to training data. Could dramatically reduce false premise and hedging failures.
Combining with WebGPT
Could combine RLHF with methods that improve truthfulness (Nakano et al., 2021), or filter pretraining data for toxic content.
Better Feedback Methods
Labelers could edit model responses to make them better, or generate natural language critiques. Comparisons aren't necessarily the most efficient signal.
Configurable Safety
Different applications have different risk levels. What a model refuses should be configurable at inference time, not hardcoded during training.
Interface Design (HCI)
Vast design space for labeler feedback interfaces. This is an interesting human-computer interaction problem that could improve alignment signals.
Principle-Based Alignment
Gabriel (2020) advocates for fair principles that receive reflective endorsement despite variation in moral beliefs. Aligning to inferred intent is simpler but limited.
Broader Impacts (Section 5.5)

Making models better at following instructions also makes them easier to misuse for misinformation and harassment. Alignment techniques are not a panacea — they're one tool in a broader safety ecosystem. The paper warns against deployment in high-stakes domains: medical diagnosis, classifying people on protected characteristics, credit/employment/housing eligibility, political ads, and law enforcement. If models are open-sourced, limiting harmful applications becomes challenging without regulation.

Test Yourself

Knowledge Quiz

Check your understanding of the InstructGPT paper with these questions.

Paper Summary

Key Takeaways

Alignment is cheaper than scale: RLHF costs <2% of pretraining compute but is more effective than 100× scale increase.

1.3B beats 175B: A small aligned model is preferred over a 100× larger unaligned model.

3 steps: SFT (demonstrate) → RM (rank) → PPO (optimize). Simple pipeline, powerful results.

Generalization: InstructGPT follows instructions in unseen domains (code, non-English) despite minimal supervision.

⚠️ Not solved: Bias, false premises, harmful instruction following, and representativeness remain open challenges.

🔮 Legacy: This paper directly led to ChatGPT — the most impactful AI product launch in history.