Data Shape Journey
The Transformer is a shapeshifting machine. Data doesn't just flow through it; it constantly changes form. This page visualizes exactly how the data looks at every single step of the process.
Context: Processing a Sequence
The breakdown below shows what happens to a single token.
But in reality, the model processes a whole sequence at once, the context window.
Example: "The cat sits on the" (Predicting: "mat")
The model flows all 5 vectors through the layers simultaneously. We focus on the last vector (highlighted) because its final transformation will predict the next word ("mat").
Input: Text
We show the last token ("the") because it's finally producing the next word.
Tokenization & Embedding
The word becomes a list of 12,288 numbers.
Attention Heads (Split)
The big vector is sliced into 96 smaller pieces to process context in parallel.
Concatenation (Merge)
The 96 independent strands of thought are merged back into one rich vector.
Processing (FFN + Add&Norm)
The vector is processed by the Feed Forward Network and normalized. This completes one "Block".
The Stack Loop (Repeat N ×)
The data shape stays the same (Vector of 12,288) but gets refined 96 times.
Unembedding (Linear)
Expanded to a score for every word in the dictionary.
Softmax Output
Scores turn into % chances.
Final Selection
We pick the word with the highest chance, and the loop restarts.