Problem Tokenizer Embeddings Attention Architecture Deep Dive LLM Lens Limitations Results Quiz Publish
Interactive Paper Explainer

Attention Is
All You Need

A visual, step-by-step guide to the most cited AI paper of all time. Understand Transformers from the ground up with practical intuition, key formulas, architecture details, and training insights.

Start Learning Read the Paper ↗
200K+
Citations
28.4
BLEU Score (EN→DE)
12hrs
Training on 8 GPUs
2017
Year Published
History

From RNNs to Transformers

The Transformer didn't appear out of nowhere. Here's the evolution of sequence modeling.

1990s
Recurrent Neural Networks (RNN)
Process sequences word by word. Forget long-range dependencies. No parallelization.
1997
Long Short-Term Memory (LSTM)
Gates control memory. Better long-range memory. Still sequential. Still slow to train.
2014
Bahdanau Attention + RNN
Attention mechanism added on top of RNNs. First time attention showed its power for translation.
2017
🚀 Transformer — No RNN at all!
Vaswani et al. remove RNNs entirely. Pure attention. Parallel. Faster. Better.
2018–now
BERT, GPT, ChatGPT, Claude...
All modern LLMs are built on the Transformer. This paper started it all.
Key Insight

The fundamental breakthrough was realizing that attention alone — without any recurrence or convolution — is sufficient for sequence-to-sequence tasks.

This enabled massive parallelization during training, which meant you could train on much larger datasets in much less time.

TRAINING TIME COMPARISON
ConvS2S (best before)~9.6×10¹⁸ FLOPs
Transformer (base)3.3×10¹⁸ FLOPs
Transformer uses ~3x less compute than the best prior model
Chapter 01

The Problem with RNNs

Before Transformers, the best sequence models were RNNs and LSTMs. But they had a fundamental flaw.

😰
RNN / LSTM Problems
  • Sequential processing — can't parallelize
  • Long sentences cause "context bottleneck"
  • Early words are compressed into a single vector
  • Gradient vanishes over long sequences
  • Slow to train on large datasets
🚀
Transformer Solutions
  • All tokens processed in parallel
  • Every token can directly "see" every other token
  • No information bottleneck — direct connections
  • Constant path length between any two positions
  • Train faster, scale to larger datasets

// RNN Context Bottleneck — The sentence "I grew up in France, so I speak fluent..."

I
h₁
grew
h₂
up
h₃
in
···
France
···
so
···
I
···
speak
···
fluent
SINGLE VECTOR
⚠️ "France" almost forgotten
???

The word "France" (early in the sentence) has almost no influence on predicting "French" at the end. Transformers fix this.

Chapter 02

Step 1 — Tokenization

Before any neural network can process text, it must be converted into numbers. Tokenization is how we break text into pieces called tokens.

// Interactive Tokenizer — type anything below 0 tokens
VOCAB SIZE
50,257
METHOD
BPE
MODEL
GPT-2
How BPE Works — Step by Step
1
Start with characters. Every word is split into individual characters + end-of-word marker.
old</w> + finest</w>
2
Count all adjacent pairs. Find the most frequent character pair across the whole corpus.
es → "es" appears 13x (finest×9 + lowest×4)
3
Merge the most frequent pair. Replace all occurrences with a new token.
finest</w>
4
Repeat until vocabulary size reached. GPT-2 performs 50,000 merges to build its 50,257-token vocabulary.
finest</w>
// Tokenizer Lab (Inspired by tiktokenizer)
Model-style controls + token ids + quick copy workflow for experimentation.
Decoded preview will appear here.
# Token Token ID Bytes
Inspired by: tiktokenizer.vercel.app and github.com/dqbd/tiktokenizer. This page uses an educational approximation for visualization.
Chapter 03

Step 2 — Word Embeddings

Tokens get converted into dense vectors of numbers. Similar words have similar vectors — the model learns geometry of meaning.

// Word Embedding Space — 2D Projection 768 dimensions in practice
Sample Embedding Vectors (d_model = 8, simplified)
Tokendim_0dim_1dim_2dim_3dim_4dim_5dim_6dim_7
Why Positional Encoding?
Without Positional Encoding:
"Dog bites man" = "Man bites dog"
The model sees the same bag of tokens and can't tell them apart.
With Positional Encoding:
Each position gets a unique sinusoidal "fingerprint" added to the token embedding.
Positional Encoding Formula
PE(pos, 2i) = sin(pos / 100002i/d_model)
PE(pos, 2i+1) = cos(pos / 100002i/d_model)

Each dimension uses a different frequency sine/cosine wave. Low dimensions oscillate fast (capture local order); high dimensions oscillate slowly (capture global position).

Dimension: dim 0
Chapter 04

The Core — Self-Attention

This is the heart of the Transformer. Every token looks at every other token and decides how much to "attend" to each one.

Query, Key, Value — The Library Analogy
Q
Query
"What am I looking for?" — You walking into the library with a request. Computed as: Input × W_Q
K
Key
"What do I have?" — Each book's label on the shelf advertising its contents. Computed as: Input × W_K
V
Value
"What do I actually give you?" — The actual content of the book once you check it out. Computed as: Input × W_V
The Scaled Dot-Product Attention Formula
Attention(Q, K, V) = softmax( Q · Kᵀ / √d_k ) · V
Q · Kᵀ
Attention Scores
Dot product between every Query and every Key. High score = high relevance.
÷ √d_k
Scaling
Prevents dot products from getting too large (which causes tiny gradients). d_k = 64 in the base model.
softmax
Normalization
Converts raw scores into probabilities (0 to 1, summing to 1). These are the attention weights.
· V
Weighted Sum
Multiply weights by Values. High-attention tokens contribute more to the output context vector.
// Interactive Self-Attention — click a word to see its attention weights Sentence: "The cat sat on the mat"
SELECT QUERY TOKEN:
ATTENTION WEIGHTS:
ATTENTION HEATMAP:
Step-by-Step Attention Computation
1
Compute Q, K, V matrices
Q = X · W_Q    shape: (seq_len × d_k)
K = X · W_K    shape: (seq_len × d_k)
V = X · W_V    shape: (seq_len × d_v)
// In base model: d_k = d_v = 64, d_model = 512
2
Compute raw attention scores
scores = Q · Kᵀ    shape: (seq_len × seq_len)
// Every token attends to every other token — O(n²) complexity
3
Scale + Softmax
scaled = scores / √d_k    // √64 = 8
weights = softmax(scaled)    // sums to 1 per row
4
Weighted sum of Values
output = weights · V    shape: (seq_len × d_v)
// Each token now has a context-aware representation
Chapter 05

Multi-Head Attention

Instead of one attention function, we run h=8 attention heads in parallel. Each head learns to focus on different aspects of language.

// 8 Attention Heads — each learns different patterns d_k = d_model / h = 512/8 = 64 per head
Multi-Head Formula
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) · Wᴼ
where head_i = Attention(Q·Wᵢ_Q, K·Wᵢ_K, V·Wᵢ_V)
Why multiple heads?
A single head averages all attention — it misses nuance. Multiple heads can specialize: one for syntax, one for semantics, one for coreference resolution.
The paper's ablation study showed 8 heads optimal. Too few — can't capture all relationships. Too many — dimensions per head become too small.
ABLATION: NUMBER OF HEADS (Table 3)
Chapter 06

Full Architecture

The complete Transformer: an Encoder stack and a Decoder stack, connected through cross-attention.

ENCODER — N=6 identical layers
🔍
Multi-Head Self-Attention
All positions attend to all positions
d_model=512
Add & Norm
Residual connection + LayerNorm
Feed-Forward Network
FFN(x) = max(0, xW₁+b₁)W₂+b₂
d_ff=2048
Add & Norm
Second residual + LayerNorm
DECODER — N=6 identical layers
🔒
Masked Multi-Head Self-Attention
Can't see future tokens (causal mask)
masked
Add & Norm
Residual + LayerNorm
🔗
Cross-Attention
Q from decoder, K/V from encoder output
cross
Feed-Forward Network
Same as encoder FFN
d_ff=2048
Key Hyperparameters — Base vs Big Model
ParameterSymbolBase ModelBig ModelMeaning
LayersN66Stack depth
Model dimensiond_model5121024Embedding size
FFN inner dimd_ff204840964× d_model
Attention headsh816Parallel heads
Key/Value dimd_k, d_v6464d_model / h
DropoutP_drop0.10.3Regularization
Parameters65M213MTotal trainable
Chapter 07

Training — Details

The optimizer, learning rate schedule, and regularization that made the Transformer work.

Optimizer
Adam (β₁=0.9, β₂=0.98, ε=10⁻⁹)

Standard Adam with unusual β₂ (very close to 1) to maintain longer history of second moment. The small ε prevents division by zero.

Learning Rate Schedule
lrate = d_model-0.5 · min(step-0.5, step · warmup-1.5)

Increase linearly for first 4,000 steps (warmup), then decrease proportionally to inverse square root.

Residual Dropout
Applied to output of each sub-layer, and to embedding+PE sums. P_drop=0.1 for base model.
Label Smoothing
ε_ls=0.1. Softens hard 0/1 targets to [0.05, 0.95]. Hurts perplexity but improves BLEU.
Hardware
8× NVIDIA P100 GPUs. Base model: 100K steps, ~12 hours. Big model: 300K steps, 3.5 days.
Chapter 08

Deep Dive — How Transformers Actually Scale

This section connects the paper to modern large language model behavior: objective, inference loop, and where quality gains come from in practice.

From Training Objective to Generation
  • Pretraining objective: predict next token using cross-entropy loss over massive corpora.
  • Gradient signal: each token prediction updates shared weights, teaching syntax + semantics jointly.
  • Instruction tuning: post-training adapts base model to follow human prompts and formats.
  • Sampling at inference: temperature, top-k, and top-p convert logits into controlled generation behavior.
  • Emergent behavior: scaling data, parameters, and compute often unlocks capabilities nonlinearly.
Inference Pipeline (Token by Token)
Step A
Prompt tokens enter the model and produce logits for the next token.
Step B
Decoding strategy selects one token, appends it, and repeats.
Step C
KV cache stores past attention states to avoid recomputing old tokens each step.
Practical implication: latency and memory become first-class constraints at long context lengths.
Why Quality Improves with Scale (Intuition)
More Data
Expands language coverage and reduces blind spots in rare patterns.
More Parameters
Increases representational capacity for nuanced concepts and relations.
More Compute
Lets optimization run longer and deeper, improving final perplexity and utility.
Better Post-Training
Alignment and instruction tuning convert raw capability into practical behavior.
Chapter 09

LLM Lens — Missing Intuition Added

Added from modern Transformer explainers: next-token probabilities, scaling law intuition, and how pretraining transfers into downstream tasks.

// Next Token Probability Lab
TEMP 1.0
Autoregressive generation (one token at a time)
Why "Large" Matters
As parameter count grows, models often cross capability thresholds. Some tasks stay near-zero, then jump sharply once scale is sufficient.
Pretraining → Fine-Tuning → Transfer
1) Pretraining learns general language statistics. 2) Fine-tuning adapts to a task. 3) Transfer learning reuses the same base model across many tasks.
Context Vectors, Not Static Tokens
A token embedding is static. Self-attention transforms it into a context vector that changes with sentence context.
Mask Before Softmax
Causal masking should set future positions to $-\infty$ before softmax, preventing probability leakage into future tokens.
Dropout in Attention
Attention dropout regularizes by randomly dropping valid attention links during training while preserving causal constraints.
Causal Masking Lab (Before vs After Softmax)
This visual shows why masking must happen before softmax. If future tokens participate in softmax first, they steal probability mass even if you zero them later.
// Decoder Masking Simulator
QUERY POS 4
Normalization Lab: Simple vs Softmax
Based on the blog's examples, this shows why softmax creates sharper, valid probability distributions and handles negative scores safely.
// Attention Score Normalization
TOP SCORE 6
Simple normalize x/sum(x)
Softmax exp(x)/sum(exp(x))
Chapter 10

Practical Limits and Modern Fixes

The original Transformer is brilliant, but not perfect. Here are the major trade-offs and what modern architectures do to improve them.

Quadratic Attention Cost
Self-attention compares every token with every token, so memory and compute scale as O(n²). Long contexts become expensive.
Context Length Ceiling
Longer context windows require larger KV caches and higher latency. This affects both training cost and inference throughput.
Data + Alignment Needs
Raw scale alone is not enough. Data quality, instruction tuning, and alignment methods strongly affect usefulness and safety.
How Newer Models Address This
FlashAttention
Faster, memory-efficient exact attention kernels.
Grouped-Query Attention
Reduces KV cache size for faster generation.
Mixture of Experts
Activates only parts of model per token for better efficiency.
Long-context strategies
RoPE scaling, sliding-window attention, retrieval augmentation.
Chapter 11

Results — State of the Art

The Transformer outperformed every previous model while using a fraction of the compute.

Transformer (big) — EN→DE
28.4
BLEU Score
Previous best (GNMT Ensemble)
26.3
BLEU Score — EN→DE
Transformer (big) — EN→FR
41.8
BLEU Score
Previous best (ConvS2S Ensemble)
41.3
BLEU Score — EN→FR
BLEU Score Comparison — EN→DE (All Models)
Self-Attention vs Recurrent vs Convolutional (Table 1)
Layer TypeComplexity/LayerSequential OpsMax Path LengthWinner?
Self-Attention O(n² · d) O(1) O(1) ✓ Best
Recurrent O(n · d²) O(n) O(n)
Convolutional O(k · n · d²) O(1) O(log_k n) ~
Self-Attn (restricted) O(r · n · d) O(1) O(n/r) ~

n = sequence length, d = dimension, k = kernel size, r = restriction neighborhood. Self-attention wins on path length (O(1) means direct connection!) but has quadratic complexity in sequence length.

Test Yourself

Quick Quiz

Check your understanding of the key concepts from the paper.

Reference

All Key Formulas

A complete reference of every important formula from the paper.

Scaled Dot-Product Attention
Attention(Q,K,V) = softmax(QKᵀ/√d_k)V
Core attention mechanism. Scale by √d_k to prevent gradient saturation. Apply softmax to get weights, multiply by values.
Multi-Head Attention
MultiHead(Q,K,V) = Concat(head₁,...,headₕ)Wᴼ
head_i = Attention(QWᵢQ, KWᵢK, VWᵢV)
h=8 attention heads run in parallel. Each projects to d_k=64 dimensions. Concat output projected back to d_model=512.
Feed-Forward Network
FFN(x) = max(0, xW₁+b₁)W₂+b₂
Two linear layers with ReLU activation. d_model=512 → d_ff=2048 → d_model=512. Applied position-wise.
Positional Encoding
PE(pos,2i) = sin(pos/10000^(2i/d_model))
PE(pos,2i+1) = cos(pos/10000^(2i/d_model))
Sine/cosine at different frequencies. Added to token embeddings. Unique fingerprint for each position. No learned parameters.
Sub-Layer Output (Residual)
Output = LayerNorm(x + Sublayer(x))
Residual connection ensures gradient flows even through deep networks. Layer norm stabilizes training. Applied after every sub-layer.
Learning Rate Schedule
lrate = d_model^(-0.5) · min(step^(-0.5), step · warmup^(-1.5))
Linear warmup for 4000 steps, then inverse square root decay. Prevents exploding gradients early in training.
Checklist: If You Understand This Page, You Can Explain
Why recurrence was removed and what was gained from parallelism.
How tokenization, embeddings, and positional encoding work together.
How Q, K, V produce attention weights and context vectors.
Why multi-head attention improves expressiveness.
How encoder-decoder cross-attention enables translation.
Where Transformers still struggle and how modern variants help.
Deployment

Publish This Site to GitHub Pages

Everything is set up for static hosting. Use the commands below to push this project and publish it with GitHub Pages.

Quick Publish Workflow
1) Initialize and push repo
Run these commands in your project folder to publish this exact repository.
git init git add . git commit -m "feat: product-ready attention explainer" git branch -M main git remote add origin https://github.com/AdilShamim8/PaperMap.git git push -u origin main
2) Enable GitHub Pages
On GitHub: Repo Settings → Pages → Build and deployment
Set Source to Deploy from a branch, select main and /(root), then Save.
https://adilshamim8.github.io/PaperMap/
Product note: a root index.html file is included so GitHub Pages works out of the box.