Attention

Masked Multi-Head Attention

You are here: The Processing Core. This block repeats N times (In the original paper N=6, but in GPT-3 N=96).

Why repeat? Understanding is deep. Layer 1 might see "Bank" + "Money". Layer 96 understands many nuances of the financial context.
Why "Multi-Head"? GPT-3 uses 96 heads running in parallel (Coincidentally, both layers and heads are 96 in this model). This allows one head to focus on grammar, while another independently focuses on earlier names or dates.

Self-Attention: Theory

The core power of the Transformer is Self-Attention. It allows the model to look at all previous words in the prompt (getContext) to understand the current situation.

The Setup: Inputs and Weight Matrices

Let's say we are processing a sequence of tokens (X), e.g., "The cat sat". Each token is a vector of size 12,288 (in GPT-3).

For each attention head, we have three distinct learned weight matrices: W_Q (Query), W_K (Key), and W_V (Value).

Projections: Creating Q, K, and V

We project the input X into three new vector spaces by multiplying with the weight matrices:

Query (Q) = X · W_Q — What the token is "looking for".
Key (K) = X · W_K — What defines the token (for matching).
Value (V) = X · W_V — The actual information to be passed along.

Calculating Relevance: Scaled Dot-Product

The model determines relevance by taking the dot product between the Query and all Keys.

S_scaled = Q · K^T √d_k

We divide by the square root of the head dimension (√d_k) to keep gradients stable. Then we apply Softmax to get probabilities.

Softmax: Creating Probabilities

The scaled scores can be any number (negative, positive, huge). We pass them through the Softmax function.

All values become positive (0 to 1).
All values sum up to exactly 1 (100%).

*This result is what we call the Attention Weights.

Crucial Detail: Causal Masking (The "No Peeking" Rule)

In GPT (Decoder-only), a word can only look back at previous words. It is forbidden from seeing future words. We achieve this by manually setting future scores to -∞ before Softmax.

Visualizing the Mask (Triangular)

✅

🚫

✅

🚫

✅

🚫

✅

Interactive Self-Attention

Select a word to process.

The model first projects this word into three roles (Q, K, V).

⬇ Step 1: Calculate Projections (Q, K, V) ⬇

Input

Input (X)

12,288

× W_Q

× W_K

× W_V

Query (Q)

128

"What I need"

Key (K)

128

"What I am"

Value (V)

128

"My Content"

⬇ Step 2: Calculate Attention Scores using Q ⬇

Query (Q)

Representation of "cat" seeking content.

Query Vector

128

Keys (K)

Representation of all words answering the query.

The

12,288

cat

12,288

sat

12,288

the

12,288

Attention Weights

Softmax(S_scaled). Represents the focus % on each word.

The 16%

cat 25%

sat 27%

on 15%

the 16%

Multi-Head Output

Combining the Heads

The process above happens h times in parallel (e.g., 96 heads).

Concatenation and Projection

The outputs of all heads are concatenated into one long vector and then projected back using a final output weight matrix W_O:

Final Output = Concat(head₁, ..., head_h) · W_O

This final vector is then passed to the Feed-Forward Network (FFN).

This final mixed vector is the "New Representation" that moves to the next step.

Old Representation

12,288

➔

New Context Aware Representation

12,288

Feed Forward & Add/Norm

Add & Norm

Feed Forward

Add & Norm

After Attention, the vector goes through:

Add & Norm: Adds the old vector to the new one (Residual Connection) to keep the signal strong.
Feed Forward Neural Network: This is the "processing brain" of the layer. It processes each word individually to digest that information.

Detailed Interpretation:

Researchers often interpret the Feed-Forward layers as a massive Key-Value memory.

Attention figures out which words are relevant (context).
Feed-Forward retrieves facts or patterns associated with those words.

Add & Norm

Feed Forward

Add & Norm

Masked Multi-Head Attention

N ×

Deep Understanding: The output of this Feed Forward network isn't the final answer yet. It becomes the Input for the next cycle. This cycle repeats N times, refining the "Bank" concept from simple money to "Financial Institution".