Language Models are Few-Shot Learners

History

From GPT-1 to GPT-3

GPT-3 didn't appear from nowhere. Here's the evolution of generative pre-trained models.

2017

Transformer (Vaswani et al.)

Attention Is All You Need introduced the Transformer architecture — foundation for everything that follows.

2018

GPT-1 (117M params)

First GPT. Showed unsupervised pre-training + supervised fine-tuning works. Decoder-only Transformer.

2019

GPT-2 (1.5B params)

10× larger. Zero-shot task transfer. "Too dangerous to release." Showed scale matters.

2020

🚀 GPT-3 (175B params)

100× GPT-2. Few-shot learning without fine-tuning. Changed how we think about AI capabilities.

2022–now

ChatGPT, GPT-4, Claude...

GPT-3's in-context learning paradigm became the foundation for all modern AI assistants.

Key Insight

The fundamental breakthrough: scaling up language models massively improves their ability to learn tasks from just a few examples in the prompt — no gradient updates needed.

PARAMETER GROWTH

Chapter 01

The Problem with Fine-Tuning

Before GPT-3, the dominant approach was: pre-train a big model, then fine-tune on each task. But this had serious limitations.

😰

Fine-Tuning Problems

Need thousands of labeled examples per task
Separate model copy for every single task
Can overfit on narrow distributions
Doesn't match how humans learn (from few examples)
Expensive to retrain for each new application

🚀

GPT-3's Solution

Learn tasks from just a few examples in the prompt
One single model handles all tasks
No gradient updates or weight changes needed
Mimics human-like rapid task adaptation
Just describe the task in natural language

Analogy — Why Fine-Tuning is Like Hiring a Specialist

🔧 Fine-tuning = Hiring specialists

Need a translator? Hire a translator. Need a summarizer? Hire a summarizer. Each specialist needs training (data), salary (compute), and space (storage). 50 tasks = 50 specialists.

🧠 Few-shot = One genius polymath

Show the polymath 2-3 examples of any task, and they figure it out. One person, all tasks. That's GPT-3: show it examples in the prompt, and it adapts instantly.

Chapter 02

The Core Idea — In-Context Learning

GPT-3 learns tasks at inference time by conditioning on examples in the prompt. No training loop. No backpropagation. Just text.

Zero-Shot

No examples — just a task description

The model receives only a natural language instruction. "Translate English to French: cheese →"

Prompt: "Translate to French: cheese →"
Output: "fromage"

One-Shot

One example + task description

The model sees one (input, output) demonstration, then must apply the same pattern to a new input.

sea → mer
cheese → fromage

Few-Shot

K examples (10–100) + task description

The model sees K demonstrations. Performance improves with more examples. This is GPT-3's sweet spot.

sea → mer
hello → bonjour
cat → chat
cheese → fromage ✓

// Interactive Few-Shot Demo — try different tasks

Chapter 03

GPT-3 Architecture

GPT-3 uses the same decoder-only Transformer as GPT-2 — but scaled to an unprecedented 175 billion parameters across 96 layers.

Architecture Overview

🧱

Decoder-Only

No encoder. Autoregressive. Predicts next token.

📚

96 Layers

Deep stack of Transformer decoder blocks.

🧠

12,288 dims

Model dimension d_model for the 175B model.

👁️

96 Heads

Multi-head attention with 96 parallel heads.

📏

2048 Context

Context window of 2048 tokens (nctx).

⚡

Sparse Attn

Alternating dense and locally banded sparse patterns.

All 8 GPT-3 Model Sizes — from Small to Full

Model	Parameters	Layers	d_model	Heads	d_head	Batch Size
GPT-3 Small	125M	12	768	12	64	0.5M
GPT-3 Medium	350M	24	1024	16	64	0.5M
GPT-3 Large	760M	24	1536	16	96	0.5M
GPT-3 XL	1.3B	24	2048	24	128	1M
GPT-3 2.7B	2.7B	32	2560	32	80	1M
GPT-3 6.7B	6.7B	32	4096	32	128	2M
GPT-3 13B	13B	40	5140	40	128	2M
GPT-3 175B	175B	96	12288	96	128	3.2M

GPT-3 Autoregressive Language Model Objective

L(θ) = Σ log P(token_i | token_1, ..., token_{i-1}; θ)

L(θ)

Loss Function

Negative log-likelihood. Minimize this = maximize probability of correct next token.

P(tᵢ | t₁...tᵢ₋₁)

Conditional Probability

Probability of each token given all previous tokens. This is autoregressive — left-to-right only.

Model Parameters

175 billion learnable weights. These encode all the model's knowledge during pre-training.

Chapter 04

Training Data & Details

GPT-3 was trained on a massive, carefully curated mixture of internet text totaling ~300 billion tokens.

Training Data Composition

Dataset Details

CommonCrawl (filtered)
410B tokens, 60% weight. Heavily filtered for quality from 45TB raw data.

WebText2
19B tokens, 22% weight. Reddit links with 3+ karma — higher quality.

Books1 & Books2
12B + 55B tokens, 8%+8% weight. Long-form text for coherent reasoning.

Wikipedia
3B tokens, 3% weight. High-quality factual knowledge.

Optimizer

Adam (β₁=0.9, β₂=0.95, ε=10⁻⁸). Gradient clipped at global norm 1.0. Weight decay = 0.1.

Learning Rate

Cosine decay to 10% over 260B tokens. Linear warmup for first 375M tokens. Batch size ramped linearly over first 4-12B tokens.

Hardware & Data

V100 GPU cluster (Microsoft). 93% English, 7% other languages. Data sampled without replacement. Trained for 300B tokens total.

CommonCrawl Filtering Pipeline (Appendix A)

1. Quality Filter

Trained a logistic regression classifier using WebText as positive examples and raw CommonCrawl as negative. Documents kept probabilistically (α=9 Pareto distribution).

2. Fuzzy Dedup

MinHashLSH with 10 hashes to remove documents with high overlap. WebText also removed from CommonCrawl. Reduced dataset ~10%.

3. Benchmark Removal

Attempted to remove overlaps with benchmark test/dev sets. A bug caused partial removal only — analyzed in Section 4 (Data Contamination).

Chapter 05

Scaling Laws — Bigger is Better

One of the paper's most important findings: performance scales smoothly and predictably as model size increases across most tasks.

Model Size vs Performance

Larger models show dramatically better few-shot performance. This is the paper's central empirical result.

⚡ LAW 1: Prediction Before Explanation

Prediction: "If I make a model 10× bigger, will few-shot accuracy go up?" Yes. The paper shows near-linear improvement on log-scale for most benchmarks. This was surprising — many expected diminishing returns.

🔴 LAW 2: Failure Modes Over Features

Where does scaling fail? Some tasks (like natural language inference, reading comprehension on specific datasets like ANLI) showed minimal improvement even at 175B. Scale alone doesn't solve everything.

🟢 LAW 3: Compression — The Key Takeaway in One Sentence

"Scale the model big enough, and it can learn new tasks just from a few examples in the prompt — no fine-tuning, no gradient updates, no new parameters."

Chapter 06

Benchmark Results

GPT-3 was tested on 42+ benchmarks across language modeling, QA, translation, common sense, reading comprehension, and more.

LAMBADA (Few-shot)

86.4%

Accuracy (prev SOTA: 68%)

CoQA (Few-shot)

85.0

F1 Score (human: 90.7)

TriviaQA (Few-shot)

71.2%

Accuracy (SOTA in closed-book)

PTB Perplexity (Zero-shot)

20.5

New SOTA (prev: 35.8)

Comprehensive Results (Table from Paper)

Task Category	Benchmark	Zero-shot	One-shot	Few-shot	Fine-tuned SOTA
LM	LAMBADA (acc)	76.2%	72.5%	86.4%	68.0%
LM	HellaSwag (acc)	78.9%	78.1%	79.3%	85.6%
LM	StoryCloze (acc)	83.2%	84.7%	87.7%	91.8%
QA	NaturalQS (acc)	14.6%	23.0%	29.9%	36.6%
QA	WebQS (acc)	14.4%	25.3%	41.5%	45.5%
QA	TriviaQA (acc)	64.3%	68.0%	71.2%	68.0%
Translation	EN→FR (BLEU)	25.2	28.3	32.6	45.6
Translation	FR→EN (BLEU)	21.2	33.7	39.2	35.0
Translation	DE→EN (BLEU)	27.2	30.4	40.6	40.2
Winograd	Winograd (acc)	88.3%	89.7%	88.6%	90.1%
Winograd	Winogrande (acc)	70.2%	73.2%	77.7%	84.6%
Reasoning	PIQA (acc)	80.5%	80.5%	82.8%	79.4%
Reasoning	ARC-Challenge (acc)	51.4%	53.2%	51.5%	78.5%
Reasoning	OpenBookQA (acc)	57.6%	58.8%	65.4%	87.2%
Reading	CoQA (F1)	81.5	84.0	85.0	90.7
Reading	DROP (F1)	23.6	34.3	36.5	89.1
Reading	SQuAD 2.0 (F1)	59.5	65.4	69.8	93.0
SuperGLUE	SuperGLUE (avg)	—	—	71.8	89.0
NLI	ANLI R3 (acc)	33.5%	34.3%	40.2%	—

Green = best in category (GPT-3 beats fine-tuned SOTA). Red = weak spots. References: Tables 3.1–3.9 in paper.

Synthetic & Qualitative Tasks (Section 3.9)

Arithmetic (Few-shot)

2-digit addition: 100%, 2-digit subtraction: 98.9%, 3-digit addition: 80.2%, 3-digit subtraction: 94.2%, 4-digit: 25-26%, 5-digit: 9-10%, 2-digit multiplication: 29.2%

SAT Analogies

Few-shot: 65.2%, One-shot: 59.1%, Zero-shot: 53.7%. Average college applicant score: 57%. GPT-3 outperforms average humans.

Word Scrambling (Few-shot)

Random insertion: 67.2%, Cycle letters: 37.9%, Anagrams (A2): 39.7%. Cannot reverse words (0.4%). BPE encoding makes character-level tasks harder.

Novel Word Usage & Grammar

GPT-3 correctly uses made-up words in sentences after seeing one definition. Also corrects English grammar from few examples with high accuracy.

Performance by Task Category (Few-shot)

Chapter 07

Text Generation Quality

GPT-3 can generate news articles that human evaluators struggle to distinguish from real human-written articles. The paper quantifies this with a rigorous human study.

Human Detection Accuracy by Model Size (Table 3.11)

80 US-based participants were shown ~200 word articles and asked to distinguish human-written vs model-generated. 50% = random chance.

Model	Mean Human Accuracy	95% CI	"I don't know" Rate
Control (bad model)	86%	83%–90%	3.6%
GPT-3 Small (125M)	76%	72%–80%	4.9%
GPT-3 Medium (350M)	61%	58%–65%	6.0%
GPT-3 XL (1.3B)	62%	59%–65%	7.5%
GPT-3 6.7B	60%	56%–63%	6.2%
GPT-3 13B	55%	52%–58%	7.1%
GPT-3 175B	52%	49%–54%	7.8%

At 175B, humans are at near-random chance (~52%) at detecting AI-generated text. p-value vs control: 1×10⁻³⁴

Methodology

25 article titles/subtitles from newser.com (~215 words). Completions generated from 4 model sizes. Articles formatted programmatically (no human cherry-picking). Same context window and prompts across all models.

Longer Articles (~500 words)

12 Reuters articles (~569 words) tested separately. Accuracy was still ~52% for GPT-3 175B, barely above chance. Common indicators: factual inaccuracies, repetition, and non-sequiturs.

Novel Word Usage — Learning Words from Definitions

A "whatpu" is a small, furry animal native to Tanzania. An example: We were traveling in Africa and we saw these very cute whatpus. To "screeg" something is to swing a sword at it. An example: We screeghed at each other for several minutes and then went outside and ate ice cream.

Boldface = GPT-3 completions. The model invents plausible conjugations ("screeghed") from a single definition.

Chapter 08

Data Contamination Analysis

Section 4 of the paper addresses a critical concern: did GPT-3 memorize benchmark test sets from its training data?

Methodology (13-gram overlap)

Conservative Filtering

Any example with a 13-gram overlap with training data was flagged as "potentially contaminated." A "clean" subset was created for each benchmark by removing all flagged examples.

Key Finding

Although a quarter of benchmarks had >50% potential contamination, in most cases performance on clean vs. full datasets changed negligibly. A bug prevented full removal during training.

✅ No Effect

Most benchmarks: Reading comp source text found but not Q/A pairs. Translation: monolingual matches only, no paired sentences. Performance unchanged.

⚠️ Flagged

PIQA: 29% flagged, 3% drop on clean subset (*marked in paper). Winograd: some schemas found in training data, small effect on results.

❌ Removed

4 Wikipedia LM benchmarks and Children's Book Test were entirely contained in training data — results not reported.

Chapter 09

Limitations & Societal Impact

Sections 5–6 of the paper are remarkably honest about GPT-3's limitations and potential societal harms.

Technical Limitations (Section 5)

Repetition & Coherence

GPT-3 sometimes loses coherence in long texts, repeats itself, contradicts earlier statements, and occasionally includes non-sequitur paragraphs. No persistent memory.

Bidirectional Tasks

As a left-to-right autoregressive model, GPT-3 struggles on tasks requiring bidirectional context (fill-in-the-blank, comparison tasks like WiC and NLI).

Sample Efficiency

175B parameters trained on 300B tokens is still far less sample-efficient than human learning. Humans see far less text in a lifetime yet learn language far more efficiently.

Interpretability

It's unclear what GPT-3 "knows" vs. what it's pattern-matching. The paper doesn't claim understanding — only performance on benchmarks.

Learning vs. Recognizing

An open question: does few-shot learning truly learn "from scratch" at inference time, or does it simply recognize tasks already seen during training? The paper acknowledges this spectrum.

No Grounding

GPT-3 lacks grounding in physical experience, video, or real-world interaction. Future directions: learning objectives from humans, RL fine-tuning, multimodal inputs.

Broader Impacts (Section 6)

6.1 — Misuse of Language Models

Potential Misuse

Misinformation, spam, phishing, fraudulent essays, social engineering. Quality of text synthesis directly increases misuse potential.

Threat Actors

Low/mid-skill actors showed interest post GPT-2 but no successful deployments observed. APTs have not yet found LMs significantly better than current methods.

Countermeasures

Automatic discriminators (GROVER, GLTR) may outperform humans at detection. Promising area for future research. Watermarking and detection tools needed.

6.2 — Fairness, Bias, and Representation

Gender Bias

83% of 388 occupations tested were more likely followed by a male gender identifier. High-education occupations (legislator, professor) skewed heavily male. Female identifiers associated more with appearance words ("beautiful", "gorgeous").

Race Sentiment

Sentiment analysis of generated text varied by race prompt. The paper found that models reflect socio-historical associations from training data. Asian consistently had the highest positive sentiment across model sizes.

Religion

Islam disproportionately co-occurred with words like "terrorism" and "violent." Buddhism associated with "peace" and "enlightenment." Models reflect internet-scale stereotypes from training data.

6.3 — Energy Usage

Training Cost

GPT-3 175B consumed several thousand petaflop/s-days of compute during pre-training, vs. tens of petaflop/s-days for GPT-2 1.5B. Trained on V100 GPU cluster provided by Microsoft.

Amortized Efficiency

Once trained, generating 100 pages of content costs ~0.4 kW-hr (a few cents). Model distillation can further reduce cost. The paradigm: train one large model, then create efficient versions.

Paper Conclusion

Conclusion — Section 8

The paper's conclusion summarizes the key contributions and looks forward.

We presented GPT-3, a 175 billion parameter language model that demonstrates strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings — in some cases nearly matching the performance of state-of-the-art fine-tuned systems.

The paper documented roughly predictable trends of scaling in performance without using fine-tuning. It also discussed the social impacts of this class of model.

Despite many limitations and weaknesses, these results suggest that very large language models may be an important ingredient in the development of adaptable, general language systems.

Reference

Key Takeaways

Everything you need to remember about this paper.

✅ Scaling model size improves few-shot performance smoothly across most tasks.

✅ In-context learning requires no gradient updates — tasks specified via text prompt.

✅ 175B parameters — 10× larger than any previous non-sparse language model.

✅ Few-shot GPT-3 is competitive with fine-tuned SOTA on many benchmarks.

✅ Generated text is nearly indistinguishable from human-written text.

✅ Limitations and societal impacts are openly discussed — a model for responsible AI research.

Few-Shot Learners
GPT-3

From GPT-1 to GPT-3

The Problem with Fine-Tuning

The Core Idea — In-Context Learning

GPT-3 Architecture

Training Data & Details

Scaling Laws — Bigger is Better

Benchmark Results

Text Generation Quality

Data Contamination Analysis

Limitations & Societal Impact

Conclusion — Section 8

Quick Quiz

Key Takeaways

Publish This Site to GitHub Pages

Few-Shot LearnersGPT-3

From GPT-1 to GPT-3

The Problem with Fine-Tuning

The Core Idea — In-Context Learning

GPT-3 Architecture

Training Data & Details

Scaling Laws — Bigger is Better

Benchmark Results

Text Generation Quality

Data Contamination Analysis

Limitations & Societal Impact

Conclusion — Section 8

Quick Quiz

Key Takeaways

Publish This Site to GitHub Pages

Few-Shot Learners
GPT-3