Attention
You are here: The Processing Core. This block repeats N times (In the original paper N=6, but in GPT-3 N=96).
- Why repeat? Understanding is deep. Layer 1 might see "Bank" + "Money". Layer 96 understands many nuances of the financial context.
- Why "Multi-Head"? GPT-3 uses 96 heads running in parallel (Coincidentally, both layers and heads are 96 in this model). This allows one head to focus on grammar, while another independently focuses on earlier names or dates.
Self-Attention: Theory
The core power of the Transformer is Self-Attention. It allows the model to look at all previous words in the prompt (getContext) to understand the current situation.
Let's say we are processing a sequence of tokens (X), e.g., "The cat sat". Each token is a vector of size 12,288 (in GPT-3).
For each attention head, we have three distinct learned weight matrices: WQ (Query), WK (Key), and WV (Value).
We project the input X into three new vector spaces by multiplying with the weight matrices:
- Query (Q) = X · WQ — What the token is "looking for".
- Key (K) = X · WK — What defines the token (for matching).
- Value (V) = X · WV — The actual information to be passed along.
The model determines relevance by taking the dot product between the Query and all Keys.
We divide by the square root of the head dimension (√dk) to keep gradients stable. Then we apply Softmax to get probabilities.
The scaled scores can be any number (negative, positive, huge). We pass them through the Softmax function.
- All values become positive (0 to 1).
- All values sum up to exactly 1 (100%).
*This result is what we call the Attention Weights.
In GPT (Decoder-only), a word can only look back at previous words. It is forbidden from seeing future words. We achieve this by manually setting future scores to -∞ before Softmax.
Visualizing the Mask (Triangular)
Interactive Self-Attention
Select a word to process.
The model first projects this word into three roles (Q, K, V).
Input
Query (Q)
Representation of "cat" seeking content.
Keys (K)
Representation of all words answering the query.
Attention Weights
Softmax(Sscaled). Represents the focus % on each word.
Multi-Head Output
Combining the Heads
The process above happens h times in parallel (e.g., 96 heads).
The outputs of all heads are concatenated into one long vector and then projected back using a final output weight matrix WO:
This final vector is then passed to the Feed-Forward Network (FFN).
This final mixed vector is the "New Representation" that moves to the next step.
Feed Forward & Add/Norm
- Add & Norm: Adds the old vector to the new one (Residual Connection) to keep the signal strong.
- Feed Forward Neural Network: This is the "processing brain" of the layer. It processes each word individually to digest that information.
Researchers often interpret the Feed-Forward layers as a massive Key-Value memory.
- Attention figures out which words are relevant (context).
- Feed-Forward retrieves facts or patterns associated with those words.
Deep Understanding: The output of this Feed Forward network isn't the final answer yet. It becomes the Input for the next cycle. This cycle repeats N times, refining the "Bank" concept from simple money to "Financial Institution".