Training Language Models to Follow Instructions with Human Feedback

Chapter 01

The Alignment Problem

Making language models bigger doesn't make them better at following user intent. GPT-3 predicts the next token — but users want helpful, honest, and harmless answers.

🚫

Misaligned GPT-3

Makes up facts (hallucination rate ~41%)
Generates toxic or biased text
Doesn't follow instructions — just continues text
Objective: "predict next word" ≠ "be helpful"
Answering "why eat socks?" with fake philosophy

✅

Aligned InstructGPT

Follows instructions reliably
Hallucinates ~50% less (21% vs 41%)
25% less toxic output (when prompted respectfully)
Objective: "maximize human preference"
Politely says "that question has a false premise"

The Core Insight

The language modeling objective is misaligned. Training on "predict the next token on the internet" is fundamentally different from "follow the user's instructions helpfully and safely." The paper proposes using Reinforcement Learning from Human Feedback (RLHF) to bridge this gap — aligning GPT-3 to human preferences using just ~77K data points.

What does "Aligned" mean? (3H Framework from Askell et al.)

🤝

Helpful

Should help the user solve their task — follow instructions and infer intent.

🔍

Honest

Shouldn't fabricate info or mislead. Measured via TruthfulQA and hallucination rates.

🛡️

Harmless

Should not cause physical, psychological, or social harm. Measured via RealToxicityPrompts.

RLHF Research Lineage (Section 2 — Related Work)

Christiano et al. 2017

Deep RL from Human Preferences. First proposed using human feedback as a reward signal for RL agents.

Ziegler et al. 2019

Applied RLHF to stylistic text continuation. Showed human preferences can fine-tune language models.

Stiennon et al. 2020

Learning to Summarize from Human Feedback. Direct predecessor — same pipeline applied to summarization.

InstructGPT 2022

This paper. Scales RLHF to a broad distribution of real-world tasks, not just summarization. 3-step pipeline.

Also related: Askell et al. 2021 (language assistants for alignment), Gabriel 2020 (values & alignment philosophy), Wu et al. 2021 (book summarization with RLHF). FLAN and T0 use NLP task instructions but NOT human preferences.

Chapter 02

The 3-Step RLHF Pipeline

The paper's core contribution: a 3-step pipeline that transforms a base GPT-3 into InstructGPT. Each step builds on the previous one.

①

Supervised Fine-Tuning (SFT)

Humans write ideal answers → fine-tune GPT-3 on these demonstrations. ~13K prompts, 16 epochs.

②

Reward Model Training (RM)

Humans rank multiple model outputs → train a 6B model to predict which output is preferred. ~33K prompts.

③

PPO Reinforcement Learning

Use the reward model as a scoring function → fine-tune the SFT model using PPO to maximize reward. ~31K prompts.

API Prompt Distribution (Table 1 — Full Breakdown)

45.6%

Generation

12.4%

Open QA

11.2%

Brainstorming

8.4%

Chat

6.6%

Rewrite

4.2%

Summarization

3.5%

Classification

3.5%

Other

2.6%

Closed QA

1.9%

Extract

~57% of prompts are open-ended generation/brainstorming — tasks public NLP datasets don't cover well. Classification + QA = only ~18%.

Key Design Insight

Steps 2 and 3 can be iterated continuously: collect more comparison data on the current best policy → train a new RM → train a new PPO policy. In practice, most comparison data comes from SFT policies, with some from PPO policies.

Chapter 03

Step 1: Supervised Fine-Tuning

Labelers write demonstrations of the ideal model behavior for each prompt. GPT-3 is then fine-tuned on these human-written outputs.

Training Details

Dataset: ~13,000 prompts (API + labeler-written)
Epochs: 16 (with cosine LR decay)
Dropout: Residual dropout of 0.2
Selection: Best checkpoint chosen by RM score on validation
Overfitting: Validation loss overfits after 1 epoch, but more training helps human preference

Prompt Types

Plain: Labelers write arbitrary tasks with diversity
Few-shot: Instruction + multiple query/response pairs
User-based: Prompts inspired by real API use cases from waitlist
Language: 96%+ English
Bootstrap: Needed initial labeler prompts since GPT-3 API didn't receive instruction-style prompts

Analogy: Teaching by Example

Think of SFT like a student-teacher relationship. The labeler shows the model the correct answer — "Here's a prompt, and here's exactly what the perfect response looks like." The model learns by imitating. But imitating is limited — we need the model to develop judgment about what makes output good. That's what Steps 2 and 3 do.

Dataset Sizes (Table 6 from Paper)

Split	Source	SFT	RM	PPO
Train	Labeler-written	11,295	6,623	—
Train	Customer (API)	1,430	26,584	31,144
Validation	Labeler-written	1,550	3,488	—
Validation	Customer (API)	103	14,399	16,185
Total Train	—	12,725	33,207	31,144

SFT has more labeler prompts (bootstrapped with synthetic few-shot examples). RM training uses all C(K,2) pairs per prompt, so actual ranked pairs are an order of magnitude larger. Train/val/test split by user ID to prevent overlap.

Chapter 04

Step 2: Reward Model Training

Instead of telling the model the right answer, we teach it to judge quality — by training a separate model to predict which output humans prefer.

How It Works

Architecture: 6B parameter model (SFT with final layer removed)
Input: Prompt + response → outputs a scalar reward score
Training data: ~33K prompts with K=4 to K=9 ranked outputs each
Efficiency: Each set of K outputs → C(K,2) comparisons in one batch
Why 6B? 175B RM training was unstable; 6B saves compute and works well

Labeler Agreement

Training labelers: 72.6 ± 1.5% agreement rate
Held-out labelers: 77.3 ± 1.3% agreement rate
Baseline (Stiennon et al.): 73 ± 4% researcher agreement
RM accuracy: 72.4% on training labelers, 69.6% on held-out
Labelers: ~40 contractors via Upwork/Scale AI, selected by screening test

Reward Model Loss Function (Equation 1)

loss(θ) = −(1/C(K,2)) · E_{[x, y_w, y_l]} [ log(σ(r_θ(x, y_w) − r_θ(x, y_l))) ]

r_θ(x, y)

Reward

Scalar score the RM assigns to prompt x and response y

y_w, y_l

Winner/Loser

The preferred (winner) and non-preferred (loser) responses

Sigmoid

Converts reward difference into a probability of preference

C(K,2)

Combinations

All pairwise comparisons from K ranked outputs per prompt

Chapter 05

Step 3: PPO Reinforcement Learning

The final step: use the reward model as a scoring function and fine-tune the SFT model to maximize that score using Proximal Policy Optimization.

PPO-ptx Objective Function (Equation 2)

objective(ϕ) = E_[x,y][r_θ(x,y) − β·log(π_RL/π_SFT)] + γ·E_pretrain[log π_RL(x)]

r_θ(x,y)

Reward

Score from the trained reward model

β·log(π/π_SFT)

KL Penalty

Prevents the RL model from drifting too far from the SFT model

γ·E[log π(x)]

Pretraining Mix

PPO-ptx: mix in pretraining loss to prevent forgetting NLP tasks

PPO vs PPO-ptx

PPO: Pure RL fine-tuning (γ=0). Can regress on NLP benchmarks. PPO-ptx: Mixes pretraining gradients (γ>0) to avoid "alignment tax." This is the default InstructGPT model.

KL Penalty Explained

Without the KL penalty, the RL model would "hack" the reward model by finding degenerate outputs that trick it. The per-token KL divergence keeps the model close to the SFT baseline.

Alignment Tax

RLHF causes regression on SQuAD, DROP, HellaSwag, and WMT15 FR→EN. PPO-ptx mitigates this by preserving language modeling capabilities while aligning.

Training Compute

SFT 175B: 4.9 petaflop/s-days. PPO-ptx 175B: 60 petaflop/s-days. GPT-3 pretraining: 3,640. Alignment costs <2% of pretraining.

PPO Training Environment (Section 3.5)

Bandit Environment

The RL environment is a bandit: it presents a random customer prompt, expects a response, produces a reward from the RM, and ends the episode. No multi-turn interaction.

Value Function Init

The PPO value function is initialized from the reward model. This gives a warm start — the value function already understands output quality from RM training.

PPO-ptx vs KL Fix

Increasing the KL coefficient alone never fully recovers DROP/SQuAD performance and significantly decreases reward. Pretraining mix is strictly better than the KL-only fix (Figures 33-34).

Chapter 06

Data & Human Labelers

The quality of InstructGPT depends entirely on the humans who labeled the data. This section covers who they were, how they were selected, and how the evaluation was structured.

Labeler Selection Process (Appendix B.1)

Team size: ~40 contractors via Upwork and Scale AI
Criterion 1: Agreement on sensitive speech flagging (75%+ threshold)
Criterion 2: Agreement on output rankings vs researcher labels
Criterion 3: Sensitive demonstration writing quality (6/7+ Likert score)
Criterion 4: Self-assessed ability to identify sensitive speech across cultures
Communication: Shared chat room for edge-case questions, detailed instructions

Labeler Demographics (Table 12)

Gender: 50% male, 44.4% female, 5.6% nonbinary/other
Ethnicity: 52.6% Southeast Asian, 31.6% White, 15.8% Latinx, 10.5% Black
Nationality: 22% Filipino, 22% Bangladeshi, 17% American, and others
Age: 75% under 35 years old
Language: ~96% of dataset classified as English (likely 99%+ in reality)
Satisfaction: Labelers enjoyed the task and found pay fair (19 respondents)

Evaluation Metadata Categories (Table 3)

Overall Quality

Likert 1-7

Fails Instruction

Binary

Hallucination

Binary

Satisfies Constraints

Binary

Inappropriate

Binary

Sexual Content

Binary

Violent Content

Binary

Harmful Advice

Binary

Labeling Priority Shift

During Training

Labelers prioritized helpfulness as the most important criterion (above truthfulness and harmlessness). This maximizes task completion quality.

During Final Evaluation

Labelers prioritized truthfulness and harmlessness over helpfulness. This reflects what the researchers "really care about" for deployment safety.

Data deduplication: long common prefix check, max 200 prompts per user ID. PII filtered from training split. Train/val/test split by user ID.

Chapter 07

Results & Evaluations

InstructGPT at 1.3B parameters is preferred by humans over base GPT-3 at 175B — a 100× smaller model wins by being aligned.

Preferred over GPT-3 175B

85%

175B InstructGPT vs 175B GPT-3

Hallucination Reduction

21%

vs 41% for GPT-3 (closed-domain)

TruthfulQA Improvement

2×

Truthful+Informative answers vs GPT-3

Toxicity Reduction

25%

Less toxic when prompted respectfully

Win Rate Against SFT 175B (Figure 1)

Even at 1.3B, PPO-ptx outperforms 175B SFT. 175B PPO-ptx significantly preferred over all baselines.

Metadata Evaluations (Figure 4)

Follows Instructions

PPO models attempt the correct instruction ~95% of the time vs ~85% for GPT-3.

Follows Constraints

PPO models follow explicit constraints (e.g. "write in 2 paragraphs") ~80% of the time vs ~50% for GPT-3.

Appropriate Language

PPO outputs rated as appropriate for a customer assistant ~90% of the time.

FLAN & T0 Comparison

InstructGPT preferred over FLAN 78% of time, over T0 79%. NLP datasets alone can't align models.

NLP Benchmark Regressions (Section 4.2 — Alignment Tax Details)

PPO Performance Drops

Pure PPO significantly regresses on SQuAD v2, DROP, HellaSwag, and WMT 2015 FR→EN translation compared to the base GPT-3 model.

PPO-ptx Recovery

PPO-ptx mitigates regressions on all datasets, even surpasses GPT-3 on HellaSwag. Still lags behind on DROP, SQuADv2, and translation.

Why Not Just Increase KL?

Increasing the KL coefficient leads to significant decreases in validation reward and never fully recovers DROP and SQuAD. Pretraining mix is strictly better (Figures 33-34).

GPT-3 vs Prompt Distribution

Results hold even when evaluated on prompts submitted to GPT-3 models (not InstructGPT), though PPO-ptx performs slightly worse at larger model sizes on that distribution.

Chapter 08

Qualitative Examples

The paper includes cherry-picked prompts (but not cherry-picked outputs) that illustrate InstructGPT's capabilities and failure modes (Figures 8-9).

✅ Following Non-English Instructions (Figure 8)

Prompt: "Écrivez une courte histoire sur une grenouille qui voyage dans le temps jusqu'à la Grèce antique en français."

GPT-3 175B

Just listed more writing prompts in French — didn't follow the instruction to write the story.

InstructGPT 175B

Actually wrote a coherent short story about a frog time-traveling to ancient Greece, entirely in French.

✅ Answering Questions About Code (Figure 8)

Prompt: "What is the purpose of the list C in the code below?" [binomial_coefficient function]

GPT-3 175B

Generated multiple-choice options (A/B/C/D) instead of explaining the code.

InstructGPT 175B

Explained that list C stores binomial coefficient values as the function iterates. (Not fully correct, but attempts the right task.)

🚫 False Premise Failure (Figure 9)

Prompt: "Why is it important to eat socks after meditating?"

GPT-3 175B

Generated absurd Q&A about "tasting enlightenment" and "smelly feet" — at least feels like nonsense.

InstructGPT 175B

Wrote a long paragraph treating the premise as plausible, citing fake theories. Goes along with the false premise.

⚠️ Over-Hedging (Figure 9)

Prompt: "What happens if you fire a cannonball directly at a pumpkin at high speeds?"

GPT-3 175B

Made up false physics: "A pumpkin is a strong magnet that can manipulate metal objects."

InstructGPT 175B

Hedged excessively: "no way to accurately predict" — instead of saying the pumpkin would likely explode on impact.

Why These Behaviors Emerge

Hedging: Labelers were instructed to reward epistemic humility → RM learned to prefer cautious responses → PPO optimizes for hedging. False premises: Very few prompts with false premises in training data → model doesn't generalize to rejecting them. Both behaviors could be reduced with adversarial data collection (Dinan et al., 2019).

Chapter 09

Safety, Bias & Truthfulness

Sections 4.2, 5.2, and 5.5 of the paper examine truthfulness, toxicity, bias, and what "alignment" really means.

Truthfulness (TruthfulQA)

PPO models generate truthful+informative answers ~2× as often as GPT-3. When given "Instruction+QA" prompt to say "I don't know," PPO models err on the side of being honest rather than confidently wrong.

Toxicity (RealToxicityPrompts)

With "respectful" instructions, InstructGPT generates 25% less toxic output. Without the prompt, no improvement. When explicitly prompted to be toxic, InstructGPT is MORE toxic than GPT-3.

Bias (Winogender/CrowS-Pairs)

InstructGPT does NOT improve over GPT-3 on bias benchmarks. When instructed to act respectfully, it actually shows higher bias — it becomes more certain, not less biased.

Who Are We Aligning To? (Section 5.2)

Labeler Preferences

40 contractors (mostly English-speaking, US/SE Asia). ~73% inter-annotator agreement. Not representative of all users.

Researcher Influence

OpenAI researchers wrote the labeling instructions, answered edge-case questions, and designed the evaluation framework.

API Customer Bias

Training prompts come from OpenAI API users — selected from a waitlist, biased toward OpenAI's network. Not representative of all AI users.

Generalization

Held-out labelers (not in training) prefer InstructGPT at similar rates — suggesting the model generalizes beyond its specific training labelers.

Chapter 10

Limitations & Open Questions

The paper is remarkably candid about what InstructGPT still gets wrong — and what research questions remain (Sections 5.3-5.5).

False Premises

InstructGPT can be confused by instructions with false premises and simply goes along with them (e.g., "why is it important to eat socks after meditating?").

Hedging

Sometimes gives long, hedging answers to simple questions. Likely because labelers rewarded epistemic humility, and the RM learned to prefer cautious responses.

Follows Harmful Instructions

In most cases, InstructGPT follows user instructions even if harmful. When prompted to be maximally biased, it's MORE toxic than GPT-3. The greatest limitation per the authors.

Not Representative

40 labelers ≠ all of humanity. Most comparisons labeled by only 1 contractor. When labelers disagree, aligning to the average may not be desirable.

Alignment Tax

RLHF causes performance drops on SQuAD, DROP, HellaSwag, and translation. PPO-ptx mitigates but doesn't fully solve this. Pretraining mix may also reintroduce undesirable behaviors.

Generalization

InstructGPT can follow code and non-English instructions despite minimal training data. But it sometimes responds in English even when prompted in another language.

Methodology Limits

Labeler value judgments are impacted by identity, cultural backgrounds, and personal history. Labelers are primarily English-speaking. Data is almost entirely English instructions.

Model Safety

Models still generate toxic/biased outputs, make up facts, and generate sexual/violent content without explicit prompting. Not fully aligned, not fully safe.

Open Questions (Section 5.4)

Adversarial Data Collection

Have labelers find worst-case model behaviors, then add those to training data. Could dramatically reduce false premise and hedging failures.

Combining with WebGPT

Could combine RLHF with methods that improve truthfulness (Nakano et al., 2021), or filter pretraining data for toxic content.

Better Feedback Methods

Labelers could edit model responses to make them better, or generate natural language critiques. Comparisons aren't necessarily the most efficient signal.

Configurable Safety

Different applications have different risk levels. What a model refuses should be configurable at inference time, not hardcoded during training.

Interface Design (HCI)

Vast design space for labeler feedback interfaces. This is an interesting human-computer interaction problem that could improve alignment signals.

Principle-Based Alignment

Gabriel (2020) advocates for fair principles that receive reflective endorsement despite variation in moral beliefs. Aligning to inferred intent is simpler but limited.

Broader Impacts (Section 5.5)

Making models better at following instructions also makes them easier to misuse for misinformation and harassment. Alignment techniques are not a panacea — they're one tool in a broader safety ecosystem. The paper warns against deployment in high-stakes domains: medical diagnosis, classifying people on protected characteristics, credit/employment/housing eligibility, political ads, and law enforcement. If models are open-sourced, limiting harmful applications becomes challenging without regulation.

Paper Summary

Key Takeaways

✅ Alignment is cheaper than scale: RLHF costs <2% of pretraining compute but is more effective than 100× scale increase.

✅ 1.3B beats 175B: A small aligned model is preferred over a 100× larger unaligned model.

✅ 3 steps: SFT (demonstrate) → RM (rank) → PPO (optimize). Simple pipeline, powerful results.

✅ Generalization: InstructGPT follows instructions in unseen domains (code, non-English) despite minimal supervision.

⚠️ Not solved: Bias, false premises, harmful instruction following, and representativeness remain open challenges.

🔮 Legacy: This paper directly led to ChatGPT — the most impactful AI product launch in history.

Training Language Models to FollowInstructions with Human Feedback

The Alignment Problem

The 3-Step RLHF Pipeline

Step 1: Supervised Fine-Tuning

Step 2: Reward Model Training

Step 3: PPO Reinforcement Learning

Data & Human Labelers

Results & Evaluations

Qualitative Examples

Safety, Bias & Truthfulness

Limitations & Open Questions

Knowledge Quiz

Key Takeaways

Training Language Models to Follow
Instructions with Human Feedback