Data Shape Journey

The Transformer is a shapeshifting machine. Data doesn't just flow through it; it constantly changes form. This page visualizes exactly how the data looks at every single step of the process.

Context: Processing a Sequence

The breakdown below shows what happens to a single token.

But in reality, the model processes a whole sequence at once, the context window.

Example: "The cat sits on the" (Predicting: "mat")

Input Matrix [5 Tokens × 12,288 Dimensions]
The
Vector 1
cat
Vector 2
sits
Vector 3
on
Vector 4
the
Vector 5

The model flows all 5 vectors through the layers simultaneously. We focus on the last vector (highlighted) because its final transformation will predict the next word ("mat").

1

Input: Text

String
"the"

We show the last token ("the") because it's finally producing the next word.

2

Tokenization & Embedding

Vector [1 × 12,288]
0.12
-0.5
...
0.99

The word becomes a list of 12,288 numbers.

3

Attention Heads (Split)

96 Vectors [1 × 128] (Parallel)
H1
H2
...
H96

The big vector is sliced into 96 smaller pieces to process context in parallel.

4

Concatenation (Merge)

Vector [1 × 12,288]
H1
H2
...
H96

The 96 independent strands of thought are merged back into one rich vector.

5

Processing (FFN + Add&Norm)

Vector [1 × 12,288]
0.8
-0.2
...
1.1

The vector is processed by the Feed Forward Network and normalized. This completes one "Block".

6

The Stack Loop (Repeat N ×)

Standard Block Structure
96 ×
Add & Norm
Feed Forward
Add & Norm
Masked Multi-Head Attention

The data shape stays the same (Vector of 12,288) but gets refined 96 times.

7

Unembedding (Linear)

Logits [1 × 50,257]
3.2
-1.5
...
4.1

Expanded to a score for every word in the dictionary.

8

Softmax Output

Probabilities [1 × 50,257]
0.05%
0.00%
...
12.5%

Scores turn into % chances.

9

Final Selection

String
"mat"

We pick the word with the highest chance, and the loop restarts.