Data Shape Journey

The Transformer is a shapeshifting machine. Data doesn't just flow through it; it constantly changes form. This page visualizes exactly how the data looks at every single step of the process.

Context: Processing a Sequence

The breakdown below shows what happens to a single token.

But in reality, the model processes a whole sequence at once, the context window.

Example: "The cat sits on the" (Predicting: "mat")

Input Matrix [5 Tokens × 12,288 Dimensions]

Vector 1

Vector 2

Vector 3

Vector 4

Vector 5

The model flows all 5 vectors through the layers simultaneously. We focus on the last vector (highlighted) because its final transformation will predict the next word ("mat").

Input: Text

String

"the"

We show the last token ("the") because it's finally producing the next word.

↓

Tokenization & Embedding

Vector [1 × 12,288]

0.12

-0.5

...

0.99

The word becomes a list of 12,288 numbers.

↓

Attention Heads (Split)

96 Vectors [1 × 128] (Parallel)

...

H96

The big vector is sliced into 96 smaller pieces to process context in parallel.

↓

Concatenation (Merge)

Vector [1 × 12,288]

...

H96

The 96 independent strands of thought are merged back into one rich vector.

↓

Processing (FFN + Add&Norm)

Vector [1 × 12,288]

0.8

-0.2

...

1.1

The vector is processed by the Feed Forward Network and normalized. This completes one "Block".

↓

The Stack Loop (Repeat N ×)

Standard Block Structure

96 ×

Add & Norm

Feed Forward

Add & Norm

Masked Multi-Head Attention

The data shape stays the same (Vector of 12,288) but gets refined 96 times.

↓

Unembedding (Linear)

Logits [1 × 50,257]

3.2

-1.5

...

4.1

Expanded to a score for every word in the dictionary.

↓

Softmax Output

Probabilities [1 × 50,257]

0.05%

0.00%

...

12.5%

Scores turn into % chances.

↓

Final Selection

String

"mat"

We pick the word with the highest chance, and the loop restarts.

Back to Overview