Problem Few-Shot Architecture Training Scaling Benchmarks Generation Contamination Limitations Quiz
Interactive Paper Explainer

Few-Shot Learners
GPT-3

A visual, step-by-step guide to the landmark paper that showed scaling language models to 175 billion parameters unlocks remarkable in-context learning — no fine-tuning needed.

Start Learning Read the Paper ↗
175B
Parameters
300B
Training Tokens
42+
Benchmarks Tested
2020
Year Published
History

From GPT-1 to GPT-3

GPT-3 didn't appear from nowhere. Here's the evolution of generative pre-trained models.

2017
Transformer (Vaswani et al.)
Attention Is All You Need introduced the Transformer architecture — foundation for everything that follows.
2018
GPT-1 (117M params)
First GPT. Showed unsupervised pre-training + supervised fine-tuning works. Decoder-only Transformer.
2019
GPT-2 (1.5B params)
10× larger. Zero-shot task transfer. "Too dangerous to release." Showed scale matters.
2020
🚀 GPT-3 (175B params)
100× GPT-2. Few-shot learning without fine-tuning. Changed how we think about AI capabilities.
2022–now
ChatGPT, GPT-4, Claude...
GPT-3's in-context learning paradigm became the foundation for all modern AI assistants.
Key Insight

The fundamental breakthrough: scaling up language models massively improves their ability to learn tasks from just a few examples in the prompt — no gradient updates needed.

PARAMETER GROWTH
Chapter 01

The Problem with Fine-Tuning

Before GPT-3, the dominant approach was: pre-train a big model, then fine-tune on each task. But this had serious limitations.

😰
Fine-Tuning Problems
  • Need thousands of labeled examples per task
  • Separate model copy for every single task
  • Can overfit on narrow distributions
  • Doesn't match how humans learn (from few examples)
  • Expensive to retrain for each new application
🚀
GPT-3's Solution
  • Learn tasks from just a few examples in the prompt
  • One single model handles all tasks
  • No gradient updates or weight changes needed
  • Mimics human-like rapid task adaptation
  • Just describe the task in natural language
Analogy — Why Fine-Tuning is Like Hiring a Specialist
🔧 Fine-tuning = Hiring specialists
Need a translator? Hire a translator. Need a summarizer? Hire a summarizer. Each specialist needs training (data), salary (compute), and space (storage). 50 tasks = 50 specialists.
🧠 Few-shot = One genius polymath
Show the polymath 2-3 examples of any task, and they figure it out. One person, all tasks. That's GPT-3: show it examples in the prompt, and it adapts instantly.
Chapter 02

The Core Idea — In-Context Learning

GPT-3 learns tasks at inference time by conditioning on examples in the prompt. No training loop. No backpropagation. Just text.

Zero-Shot
No examples — just a task description
The model receives only a natural language instruction. "Translate English to French: cheese →"
Prompt: "Translate to French: cheese →"
Output: "fromage"
One-Shot
One example + task description
The model sees one (input, output) demonstration, then must apply the same pattern to a new input.
sea → mer
cheese → fromage
Few-Shot
K examples (10–100) + task description
The model sees K demonstrations. Performance improves with more examples. This is GPT-3's sweet spot.
sea → mer
hello → bonjour
cat → chat
cheese → fromage ✓
// Interactive Few-Shot Demo — try different tasks
Chapter 03

GPT-3 Architecture

GPT-3 uses the same decoder-only Transformer as GPT-2 — but scaled to an unprecedented 175 billion parameters across 96 layers.

Architecture Overview
🧱
Decoder-Only
No encoder. Autoregressive. Predicts next token.
📚
96 Layers
Deep stack of Transformer decoder blocks.
🧠
12,288 dims
Model dimension d_model for the 175B model.
👁️
96 Heads
Multi-head attention with 96 parallel heads.
📏
2048 Context
Context window of 2048 tokens (nctx).
Sparse Attn
Alternating dense and locally banded sparse patterns.
All 8 GPT-3 Model Sizes — from Small to Full
ModelParametersLayersd_modelHeadsd_headBatch Size
GPT-3 Small125M1276812640.5M
GPT-3 Medium350M24102416640.5M
GPT-3 Large760M24153616960.5M
GPT-3 XL1.3B242048241281M
GPT-3 2.7B2.7B32256032801M
GPT-3 6.7B6.7B324096321282M
GPT-3 13B13B405140401282M
GPT-3 175B175B9612288961283.2M
GPT-3 Autoregressive Language Model Objective
L(θ) = Σ log P(token_i | token_1, ..., token_{i-1}; θ)
L(θ)
Loss Function
Negative log-likelihood. Minimize this = maximize probability of correct next token.
P(tᵢ | t₁...tᵢ₋₁)
Conditional Probability
Probability of each token given all previous tokens. This is autoregressive — left-to-right only.
θ
Model Parameters
175 billion learnable weights. These encode all the model's knowledge during pre-training.
Chapter 04

Training Data & Details

GPT-3 was trained on a massive, carefully curated mixture of internet text totaling ~300 billion tokens.

Training Data Composition
Dataset Details
CommonCrawl (filtered)
410B tokens, 60% weight. Heavily filtered for quality from 45TB raw data.
WebText2
19B tokens, 22% weight. Reddit links with 3+ karma — higher quality.
Books1 & Books2
12B + 55B tokens, 8%+8% weight. Long-form text for coherent reasoning.
Wikipedia
3B tokens, 3% weight. High-quality factual knowledge.
Optimizer
Adam (β₁=0.9, β₂=0.95, ε=10⁻⁸). Gradient clipped at global norm 1.0. Weight decay = 0.1.
Learning Rate
Cosine decay to 10% over 260B tokens. Linear warmup for first 375M tokens. Batch size ramped linearly over first 4-12B tokens.
Hardware & Data
V100 GPU cluster (Microsoft). 93% English, 7% other languages. Data sampled without replacement. Trained for 300B tokens total.
CommonCrawl Filtering Pipeline (Appendix A)
1. Quality Filter
Trained a logistic regression classifier using WebText as positive examples and raw CommonCrawl as negative. Documents kept probabilistically (α=9 Pareto distribution).
2. Fuzzy Dedup
MinHashLSH with 10 hashes to remove documents with high overlap. WebText also removed from CommonCrawl. Reduced dataset ~10%.
3. Benchmark Removal
Attempted to remove overlaps with benchmark test/dev sets. A bug caused partial removal only — analyzed in Section 4 (Data Contamination).
Chapter 05

Scaling Laws — Bigger is Better

One of the paper's most important findings: performance scales smoothly and predictably as model size increases across most tasks.

Model Size vs Performance
Larger models show dramatically better few-shot performance. This is the paper's central empirical result.
⚡ LAW 1: Prediction Before Explanation

Prediction: "If I make a model 10× bigger, will few-shot accuracy go up?" Yes. The paper shows near-linear improvement on log-scale for most benchmarks. This was surprising — many expected diminishing returns.

🔴 LAW 2: Failure Modes Over Features

Where does scaling fail? Some tasks (like natural language inference, reading comprehension on specific datasets like ANLI) showed minimal improvement even at 175B. Scale alone doesn't solve everything.

🟢 LAW 3: Compression — The Key Takeaway in One Sentence

"Scale the model big enough, and it can learn new tasks just from a few examples in the prompt — no fine-tuning, no gradient updates, no new parameters."

Chapter 06

Benchmark Results

GPT-3 was tested on 42+ benchmarks across language modeling, QA, translation, common sense, reading comprehension, and more.

LAMBADA (Few-shot)
86.4%
Accuracy (prev SOTA: 68%)
CoQA (Few-shot)
85.0
F1 Score (human: 90.7)
TriviaQA (Few-shot)
71.2%
Accuracy (SOTA in closed-book)
PTB Perplexity (Zero-shot)
20.5
New SOTA (prev: 35.8)
Comprehensive Results (Table from Paper)
Task CategoryBenchmarkZero-shotOne-shotFew-shotFine-tuned SOTA
LMLAMBADA (acc)76.2%72.5%86.4%68.0%
LMHellaSwag (acc)78.9%78.1%79.3%85.6%
LMStoryCloze (acc)83.2%84.7%87.7%91.8%
QANaturalQS (acc)14.6%23.0%29.9%36.6%
QAWebQS (acc)14.4%25.3%41.5%45.5%
QATriviaQA (acc)64.3%68.0%71.2%68.0%
TranslationEN→FR (BLEU)25.228.332.645.6
TranslationFR→EN (BLEU)21.233.739.235.0
TranslationDE→EN (BLEU)27.230.440.640.2
WinogradWinograd (acc)88.3%89.7%88.6%90.1%
WinogradWinogrande (acc)70.2%73.2%77.7%84.6%
ReasoningPIQA (acc)80.5%80.5%82.8%79.4%
ReasoningARC-Challenge (acc)51.4%53.2%51.5%78.5%
ReasoningOpenBookQA (acc)57.6%58.8%65.4%87.2%
ReadingCoQA (F1)81.584.085.090.7
ReadingDROP (F1)23.634.336.589.1
ReadingSQuAD 2.0 (F1)59.565.469.893.0
SuperGLUESuperGLUE (avg)71.889.0
NLIANLI R3 (acc)33.5%34.3%40.2%
Green = best in category (GPT-3 beats fine-tuned SOTA). Red = weak spots. References: Tables 3.1–3.9 in paper.
Synthetic & Qualitative Tasks (Section 3.9)
Arithmetic (Few-shot)
2-digit addition: 100%, 2-digit subtraction: 98.9%, 3-digit addition: 80.2%, 3-digit subtraction: 94.2%, 4-digit: 25-26%, 5-digit: 9-10%, 2-digit multiplication: 29.2%
SAT Analogies
Few-shot: 65.2%, One-shot: 59.1%, Zero-shot: 53.7%. Average college applicant score: 57%. GPT-3 outperforms average humans.
Word Scrambling (Few-shot)
Random insertion: 67.2%, Cycle letters: 37.9%, Anagrams (A2): 39.7%. Cannot reverse words (0.4%). BPE encoding makes character-level tasks harder.
Novel Word Usage & Grammar
GPT-3 correctly uses made-up words in sentences after seeing one definition. Also corrects English grammar from few examples with high accuracy.
Performance by Task Category (Few-shot)
Chapter 07

Text Generation Quality

GPT-3 can generate news articles that human evaluators struggle to distinguish from real human-written articles. The paper quantifies this with a rigorous human study.

Human Detection Accuracy by Model Size (Table 3.11)

80 US-based participants were shown ~200 word articles and asked to distinguish human-written vs model-generated. 50% = random chance.

ModelMean Human Accuracy95% CI"I don't know" Rate
Control (bad model)86%83%–90%3.6%
GPT-3 Small (125M)76%72%–80%4.9%
GPT-3 Medium (350M)61%58%–65%6.0%
GPT-3 XL (1.3B)62%59%–65%7.5%
GPT-3 6.7B60%56%–63%6.2%
GPT-3 13B55%52%–58%7.1%
GPT-3 175B52%49%–54%7.8%
At 175B, humans are at near-random chance (~52%) at detecting AI-generated text. p-value vs control: 1×10⁻³⁴
Methodology
25 article titles/subtitles from newser.com (~215 words). Completions generated from 4 model sizes. Articles formatted programmatically (no human cherry-picking). Same context window and prompts across all models.
Longer Articles (~500 words)
12 Reuters articles (~569 words) tested separately. Accuracy was still ~52% for GPT-3 175B, barely above chance. Common indicators: factual inaccuracies, repetition, and non-sequiturs.
Novel Word Usage — Learning Words from Definitions
A "whatpu" is a small, furry animal native to Tanzania. An example: We were traveling in Africa and we saw these very cute whatpus. To "screeg" something is to swing a sword at it. An example: We screeghed at each other for several minutes and then went outside and ate ice cream.
Boldface = GPT-3 completions. The model invents plausible conjugations ("screeghed") from a single definition.
Chapter 08

Data Contamination Analysis

Section 4 of the paper addresses a critical concern: did GPT-3 memorize benchmark test sets from its training data?

Methodology (13-gram overlap)
Conservative Filtering
Any example with a 13-gram overlap with training data was flagged as "potentially contaminated." A "clean" subset was created for each benchmark by removing all flagged examples.
Key Finding
Although a quarter of benchmarks had >50% potential contamination, in most cases performance on clean vs. full datasets changed negligibly. A bug prevented full removal during training.
✅ No Effect
Most benchmarks: Reading comp source text found but not Q/A pairs. Translation: monolingual matches only, no paired sentences. Performance unchanged.
⚠️ Flagged
PIQA: 29% flagged, 3% drop on clean subset (*marked in paper). Winograd: some schemas found in training data, small effect on results.
❌ Removed
4 Wikipedia LM benchmarks and Children's Book Test were entirely contained in training data — results not reported.
Chapter 09

Limitations & Societal Impact

Sections 5–6 of the paper are remarkably honest about GPT-3's limitations and potential societal harms.

Technical Limitations (Section 5)
Repetition & Coherence
GPT-3 sometimes loses coherence in long texts, repeats itself, contradicts earlier statements, and occasionally includes non-sequitur paragraphs. No persistent memory.
Bidirectional Tasks
As a left-to-right autoregressive model, GPT-3 struggles on tasks requiring bidirectional context (fill-in-the-blank, comparison tasks like WiC and NLI).
Sample Efficiency
175B parameters trained on 300B tokens is still far less sample-efficient than human learning. Humans see far less text in a lifetime yet learn language far more efficiently.
Interpretability
It's unclear what GPT-3 "knows" vs. what it's pattern-matching. The paper doesn't claim understanding — only performance on benchmarks.
Learning vs. Recognizing
An open question: does few-shot learning truly learn "from scratch" at inference time, or does it simply recognize tasks already seen during training? The paper acknowledges this spectrum.
No Grounding
GPT-3 lacks grounding in physical experience, video, or real-world interaction. Future directions: learning objectives from humans, RL fine-tuning, multimodal inputs.
Broader Impacts (Section 6)
6.1 — Misuse of Language Models
Potential Misuse
Misinformation, spam, phishing, fraudulent essays, social engineering. Quality of text synthesis directly increases misuse potential.
Threat Actors
Low/mid-skill actors showed interest post GPT-2 but no successful deployments observed. APTs have not yet found LMs significantly better than current methods.
Countermeasures
Automatic discriminators (GROVER, GLTR) may outperform humans at detection. Promising area for future research. Watermarking and detection tools needed.
6.2 — Fairness, Bias, and Representation
Gender Bias
83% of 388 occupations tested were more likely followed by a male gender identifier. High-education occupations (legislator, professor) skewed heavily male. Female identifiers associated more with appearance words ("beautiful", "gorgeous").
Race Sentiment
Sentiment analysis of generated text varied by race prompt. The paper found that models reflect socio-historical associations from training data. Asian consistently had the highest positive sentiment across model sizes.
Religion
Islam disproportionately co-occurred with words like "terrorism" and "violent." Buddhism associated with "peace" and "enlightenment." Models reflect internet-scale stereotypes from training data.
6.3 — Energy Usage
Training Cost
GPT-3 175B consumed several thousand petaflop/s-days of compute during pre-training, vs. tens of petaflop/s-days for GPT-2 1.5B. Trained on V100 GPU cluster provided by Microsoft.
Amortized Efficiency
Once trained, generating 100 pages of content costs ~0.4 kW-hr (a few cents). Model distillation can further reduce cost. The paradigm: train one large model, then create efficient versions.
Paper Conclusion

Conclusion — Section 8

The paper's conclusion summarizes the key contributions and looks forward.

We presented GPT-3, a 175 billion parameter language model that demonstrates strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings — in some cases nearly matching the performance of state-of-the-art fine-tuned systems.

The paper documented roughly predictable trends of scaling in performance without using fine-tuning. It also discussed the social impacts of this class of model.

Despite many limitations and weaknesses, these results suggest that very large language models may be an important ingredient in the development of adaptable, general language systems.

Test Yourself

Quick Quiz

Check your understanding of the key concepts from the GPT-3 paper.

Reference

Key Takeaways

Everything you need to remember about this paper.

✅ Scaling model size improves few-shot performance smoothly across most tasks.
✅ In-context learning requires no gradient updates — tasks specified via text prompt.
✅ 175B parameters — 10× larger than any previous non-sparse language model.
✅ Few-shot GPT-3 is competitive with fine-tuned SOTA on many benchmarks.
✅ Generated text is nearly indistinguishable from human-written text.
✅ Limitations and societal impacts are openly discussed — a model for responsible AI research.
Deployment

Publish This Site to GitHub Pages

Push this project and publish with GitHub Pages.

Quick Publish Workflow
1) Push to GitHub
git add .
git commit -m "feat: GPT-3 paper explainer"
git push -u origin main
2) Enable GitHub Pages
Repo Settings → Pages → Deploy from branch → main → /(root) → Save