Interactive Paper Explainer

Attention Is
All You Need

A visual, step-by-step guide to the most cited AI paper of all time. Understand Transformers from the ground up with practical intuition, key formulas, architecture details, and training insights.

Start Learning Read the Paper ↗

200K+

Citations

28.4

BLEU Score (EN→DE)

12hrs

Training on 8 GPUs

2017

Year Published

History

From RNNs to Transformers

The Transformer didn't appear out of nowhere. Here's the evolution of sequence modeling.

1990s

Recurrent Neural Networks (RNN)

Process sequences word by word. Forget long-range dependencies. No parallelization.

1997

Long Short-Term Memory (LSTM)

Gates control memory. Better long-range memory. Still sequential. Still slow to train.

2014

Bahdanau Attention + RNN

Attention mechanism added on top of RNNs. First time attention showed its power for translation.

2017

🚀 Transformer — No RNN at all!

Vaswani et al. remove RNNs entirely. Pure attention. Parallel. Faster. Better.

2018–now

BERT, GPT, ChatGPT, Claude...

All modern LLMs are built on the Transformer. This paper started it all.

Key Insight

The fundamental breakthrough was realizing that attention alone — without any recurrence or convolution — is sufficient for sequence-to-sequence tasks.

This enabled massive parallelization during training, which meant you could train on much larger datasets in much less time.

TRAINING TIME COMPARISON

ConvS2S (best before)~9.6×10¹⁸ FLOPs

Transformer (base)3.3×10¹⁸ FLOPs

Transformer uses ~3x less compute than the best prior model

Chapter 01

The Problem with RNNs

Before Transformers, the best sequence models were RNNs and LSTMs. But they had a fundamental flaw.

😰

RNN / LSTM Problems

Sequential processing — can't parallelize
Long sentences cause "context bottleneck"
Early words are compressed into a single vector
Gradient vanishes over long sequences
Slow to train on large datasets

🚀

Transformer Solutions

All tokens processed in parallel
Every token can directly "see" every other token
No information bottleneck — direct connections
Constant path length between any two positions
Train faster, scale to larger datasets

// RNN Context Bottleneck — The sentence "I grew up in France, so I speak fluent..."

→

h₁

→

grew

→

h₂

→

h₃

→

···

France

···

speak

···

fluent

→

SINGLE VECTOR
⚠️ "France" almost forgotten

→

???

The word "France" (early in the sentence) has almost no influence on predicting "French" at the end. Transformers fix this.

Chapter 02

Step 1 — Tokenization

Before any neural network can process text, it must be converted into numbers. Tokenization is how we break text into pieces called tokens.

// Interactive Tokenizer — type anything below 0 tokens

VOCAB SIZE
50,257

METHOD
BPE

MODEL
GPT-2

How BPE Works — Step by Step

Start with characters. Every word is split into individual characters + end-of-word marker.

old</w> + finest</w>

Count all adjacent pairs. Find the most frequent character pair across the whole corpus.

es → "es" appears 13x (finest×9 + lowest×4)

Merge the most frequent pair. Replace all occurrences with a new token.

finest</w>

Repeat until vocabulary size reached. GPT-2 performs 50,000 merges to build its 50,257-token vocabulary.

finest</w>

// Tokenizer Lab (Inspired by tiktokenizer)

Model-style controls + token ids + quick copy workflow for experimentation.

include special

Decoded preview will appear here.

#	Token	Token ID	Bytes

Inspired by: tiktokenizer.vercel.app and github.com/dqbd/tiktokenizer. This page uses an educational approximation for visualization.

Chapter 03

Step 2 — Word Embeddings

Tokens get converted into dense vectors of numbers. Similar words have similar vectors — the model learns geometry of meaning.

// Word Embedding Space — 2D Projection 768 dimensions in practice

Sample Embedding Vectors (d_model = 8, simplified)

Token	dim_0	dim_1	dim_2	dim_3	dim_4	dim_5	dim_6	dim_7

Why Positional Encoding?

Without Positional Encoding:

"Dog bites man" = "Man bites dog"
The model sees the same bag of tokens and can't tell them apart.

With Positional Encoding:

Each position gets a unique sinusoidal "fingerprint" added to the token embedding.

Positional Encoding Formula

PE(pos, 2i) = sin(pos / 10000^2i/d_model)
PE(pos, 2i+1) = cos(pos / 10000^2i/d_model)

Each dimension uses a different frequency sine/cosine wave. Low dimensions oscillate fast (capture local order); high dimensions oscillate slowly (capture global position).

Dimension: dim 0

Chapter 04

The Core — Self-Attention

This is the heart of the Transformer. Every token looks at every other token and decides how much to "attend" to each one.

Query, Key, Value — The Library Analogy

Query

"What am I looking for?" — You walking into the library with a request. Computed as: Input × W_Q

Key

"What do I have?" — Each book's label on the shelf advertising its contents. Computed as: Input × W_K

Value

"What do I actually give you?" — The actual content of the book once you check it out. Computed as: Input × W_V

The Scaled Dot-Product Attention Formula

Attention(Q, K, V) = softmax( Q · Kᵀ / √d_k ) · V

Q · Kᵀ

Attention Scores

Dot product between every Query and every Key. High score = high relevance.

÷ √d_k

Scaling

Prevents dot products from getting too large (which causes tiny gradients). d_k = 64 in the base model.

softmax

Normalization

Converts raw scores into probabilities (0 to 1, summing to 1). These are the attention weights.

· V

Weighted Sum

Multiply weights by Values. High-attention tokens contribute more to the output context vector.

// Interactive Self-Attention — click a word to see its attention weights Sentence: "The cat sat on the mat"

SELECT QUERY TOKEN:

ATTENTION WEIGHTS:

ATTENTION HEATMAP:

Step-by-Step Attention Computation

Compute Q, K, V matrices

Q = X · W_Q    shape: (seq_len × d_k)
K = X · W_K    shape: (seq_len × d_k)
V = X · W_V    shape: (seq_len × d_v)
// In base model: d_k = d_v = 64, d_model = 512

Compute raw attention scores

scores = Q · Kᵀ shape: (seq_len × seq_len)
// Every token attends to every other token — O(n²) complexity

Scale + Softmax

scaled = scores / √d_k // √64 = 8
weights = softmax(scaled) // sums to 1 per row

Weighted sum of Values

output = weights · V shape: (seq_len × d_v)
// Each token now has a context-aware representation

Chapter 05

Multi-Head Attention

Instead of one attention function, we run h=8 attention heads in parallel. Each head learns to focus on different aspects of language.

// 8 Attention Heads — each learns different patterns d_k = d_model / h = 512/8 = 64 per head

Multi-Head Formula

MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) · Wᴼ
where head_i = Attention(Q·Wᵢ_Q, K·Wᵢ_K, V·Wᵢ_V)

Why multiple heads?
A single head averages all attention — it misses nuance. Multiple heads can specialize: one for syntax, one for semantics, one for coreference resolution.

The paper's ablation study showed 8 heads optimal. Too few — can't capture all relationships. Too many — dimensions per head become too small.

ABLATION: NUMBER OF HEADS (Table 3)

Chapter 06

Full Architecture

The complete Transformer: an Encoder stack and a Decoder stack, connected through cross-attention.

ENCODER — N=6 identical layers

🔍

Multi-Head Self-Attention

All positions attend to all positions

d_model=512

➕

Add & Norm

Residual connection + LayerNorm

⚡

Feed-Forward Network

FFN(x) = max(0, xW₁+b₁)W₂+b₂

d_ff=2048

➕

Add & Norm

Second residual + LayerNorm

DECODER — N=6 identical layers

🔒

Masked Multi-Head Self-Attention

Can't see future tokens (causal mask)

masked

➕

Add & Norm

Residual + LayerNorm

🔗

Cross-Attention

Q from decoder, K/V from encoder output

cross

⚡

Feed-Forward Network

Same as encoder FFN

d_ff=2048

Key Hyperparameters — Base vs Big Model

Parameter	Symbol	Base Model	Big Model	Meaning
Layers	N	6	6	Stack depth
Model dimension	d_model	512	1024	Embedding size
FFN inner dim	d_ff	2048	4096	4× d_model
Attention heads	h	8	16	Parallel heads
Key/Value dim	d_k, d_v	64	64	d_model / h
Dropout	P_drop	0.1	0.3	Regularization
Parameters	—	65M	213M	Total trainable

Chapter 07

Training — Details

The optimizer, learning rate schedule, and regularization that made the Transformer work.

Optimizer

Adam (β₁=0.9, β₂=0.98, ε=10⁻⁹)

Standard Adam with unusual β₂ (very close to 1) to maintain longer history of second moment. The small ε prevents division by zero.

Learning Rate Schedule

lrate = d_model^-0.5 · min(step^-0.5, step · warmup^-1.5)

Increase linearly for first 4,000 steps (warmup), then decrease proportionally to inverse square root.

Residual Dropout

Applied to output of each sub-layer, and to embedding+PE sums. P_drop=0.1 for base model.

Label Smoothing

ε_ls=0.1. Softens hard 0/1 targets to [0.05, 0.95]. Hurts perplexity but improves BLEU.

Hardware

8× NVIDIA P100 GPUs. Base model: 100K steps, ~12 hours. Big model: 300K steps, 3.5 days.

Chapter 08

Deep Dive — How Transformers Actually Scale

This section connects the paper to modern large language model behavior: objective, inference loop, and where quality gains come from in practice.

From Training Objective to Generation

Pretraining objective: predict next token using cross-entropy loss over massive corpora.
Gradient signal: each token prediction updates shared weights, teaching syntax + semantics jointly.
Instruction tuning: post-training adapts base model to follow human prompts and formats.
Sampling at inference: temperature, top-k, and top-p convert logits into controlled generation behavior.
Emergent behavior: scaling data, parameters, and compute often unlocks capabilities nonlinearly.

Inference Pipeline (Token by Token)

Step A

Prompt tokens enter the model and produce logits for the next token.

Step B

Decoding strategy selects one token, appends it, and repeats.

Step C

KV cache stores past attention states to avoid recomputing old tokens each step.

Practical implication: latency and memory become first-class constraints at long context lengths.

Why Quality Improves with Scale (Intuition)

More Data
Expands language coverage and reduces blind spots in rare patterns.

More Parameters
Increases representational capacity for nuanced concepts and relations.

More Compute
Lets optimization run longer and deeper, improving final perplexity and utility.

Better Post-Training
Alignment and instruction tuning convert raw capability into practical behavior.

Chapter 09

LLM Lens — Missing Intuition Added

Added from modern Transformer explainers: next-token probabilities, scaling law intuition, and how pretraining transfers into downstream tasks.

// Next Token Probability Lab

TEMP 1.0

Autoregressive generation (one token at a time)

Why "Large" Matters

As parameter count grows, models often cross capability thresholds. Some tasks stay near-zero, then jump sharply once scale is sufficient.

Pretraining → Fine-Tuning → Transfer

1) Pretraining learns general language statistics. 2) Fine-tuning adapts to a task. 3) Transfer learning reuses the same base model across many tasks.

Context Vectors, Not Static Tokens

A token embedding is static. Self-attention transforms it into a context vector that changes with sentence context.

Mask Before Softmax

Causal masking should set future positions to $-\infty$ before softmax, preventing probability leakage into future tokens.

Dropout in Attention

Attention dropout regularizes by randomly dropping valid attention links during training while preserving causal constraints.

Causal Masking Lab (Before vs After Softmax)

This visual shows why masking must happen before softmax. If future tokens participate in softmax first, they steal probability mass even if you zero them later.

// Decoder Masking Simulator

QUERY POS 4

Normalization Lab: Simple vs Softmax

Based on the blog's examples, this shows why softmax creates sharper, valid probability distributions and handles negative scores safely.

// Attention Score Normalization

TOP SCORE 6

Use negative score (x3 = -3)

Simple normalize x/sum(x)

Softmax exp(x)/sum(exp(x))

Chapter 10

Practical Limits and Modern Fixes

The original Transformer is brilliant, but not perfect. Here are the major trade-offs and what modern architectures do to improve them.

Quadratic Attention Cost

Self-attention compares every token with every token, so memory and compute scale as O(n²). Long contexts become expensive.

Context Length Ceiling

Longer context windows require larger KV caches and higher latency. This affects both training cost and inference throughput.

Data + Alignment Needs

Raw scale alone is not enough. Data quality, instruction tuning, and alignment methods strongly affect usefulness and safety.

How Newer Models Address This

FlashAttention
Faster, memory-efficient exact attention kernels.

Grouped-Query Attention
Reduces KV cache size for faster generation.

Mixture of Experts
Activates only parts of model per token for better efficiency.

Long-context strategies
RoPE scaling, sliding-window attention, retrieval augmentation.

Chapter 11

Results — State of the Art

The Transformer outperformed every previous model while using a fraction of the compute.

Transformer (big) — EN→DE

28.4

BLEU Score

Previous best (GNMT Ensemble)

26.3

BLEU Score — EN→DE

Transformer (big) — EN→FR

41.8

BLEU Score

Previous best (ConvS2S Ensemble)

41.3

BLEU Score — EN→FR

BLEU Score Comparison — EN→DE (All Models)

Self-Attention vs Recurrent vs Convolutional (Table 1)

Layer Type	Complexity/Layer	Sequential Ops	Max Path Length	Winner?
Self-Attention	O(n² · d)	O(1)	O(1)	✓ Best
Recurrent	O(n · d²)	O(n)	O(n)	✗
Convolutional	O(k · n · d²)	O(1)	O(log_k n)	~
Self-Attn (restricted)	O(r · n · d)	O(1)	O(n/r)	~

n = sequence length, d = dimension, k = kernel size, r = restriction neighborhood. Self-attention wins on path length (O(1) means direct connection!) but has quadratic complexity in sequence length.

Reference

All Key Formulas

A complete reference of every important formula from the paper.

Scaled Dot-Product Attention

Attention(Q,K,V) = softmax(QKᵀ/√d_k)V

Core attention mechanism. Scale by √d_k to prevent gradient saturation. Apply softmax to get weights, multiply by values.

Multi-Head Attention

MultiHead(Q,K,V) = Concat(head₁,...,headₕ)Wᴼ
head_i = Attention(QWᵢQ, KWᵢK, VWᵢV)

h=8 attention heads run in parallel. Each projects to d_k=64 dimensions. Concat output projected back to d_model=512.

Feed-Forward Network

FFN(x) = max(0, xW₁+b₁)W₂+b₂

Two linear layers with ReLU activation. d_model=512 → d_ff=2048 → d_model=512. Applied position-wise.

Positional Encoding

PE(pos,2i) = sin(pos/10000^(2i/d_model))
PE(pos,2i+1) = cos(pos/10000^(2i/d_model))

Sine/cosine at different frequencies. Added to token embeddings. Unique fingerprint for each position. No learned parameters.

Sub-Layer Output (Residual)

Output = LayerNorm(x + Sublayer(x))

Residual connection ensures gradient flows even through deep networks. Layer norm stabilizes training. Applied after every sub-layer.

Learning Rate Schedule

lrate = d_model^(-0.5) · min(step^(-0.5), step · warmup^(-1.5))

Linear warmup for 4000 steps, then inverse square root decay. Prevents exploding gradients early in training.

Checklist: If You Understand This Page, You Can Explain

Why recurrence was removed and what was gained from parallelism.

How tokenization, embeddings, and positional encoding work together.

How Q, K, V produce attention weights and context vectors.

Why multi-head attention improves expressiveness.

How encoder-decoder cross-attention enables translation.

Where Transformers still struggle and how modern variants help.

Deployment

Publish This Site to GitHub Pages

Everything is set up for static hosting. Use the commands below to push this project and publish it with GitHub Pages.

Quick Publish Workflow

1) Initialize and push repo

Run these commands in your project folder to publish this exact repository.

git init git add . git commit -m "feat: product-ready attention explainer" git branch -M main git remote add origin https://github.com/AdilShamim8/PaperMap.git git push -u origin main

2) Enable GitHub Pages

On GitHub: Repo Settings → Pages → Build and deployment
Set Source to Deploy from a branch, select main and /(root), then Save.

https://adilshamim8.github.io/PaperMap/

Product note: a root index.html file is included so GitHub Pages works out of the box.

Attention IsAll You Need

From RNNs to Transformers

The Problem with RNNs

// RNN Context Bottleneck — The sentence "I grew up in France, so I speak fluent..."

Step 1 — Tokenization

Step 2 — Word Embeddings

The Core — Self-Attention

Multi-Head Attention

Full Architecture

Training — Details

Deep Dive — How Transformers Actually Scale

LLM Lens — Missing Intuition Added

Practical Limits and Modern Fixes

Results — State of the Art

Quick Quiz

All Key Formulas

Publish This Site to GitHub Pages

Attention Is
All You Need