Overview

What is GPT?

G - Generative: It generates new text, one word at a time.

P - Pre-trained: Notice the colored blocks (Parameters) further down the page? These represent the learned weights (knowledge) the model acquired during training. It already "knows" what to do before we give it input.

T - Transformer: The entire process described on this page—processing sequences in parallel to understand meaning—is the Transformer architecture.

What is a Transformer?

The Transformer is a deep learning architecture introduced in 2017 that revolutionized Natural Language Processing. Unlike previous models that processed words one by one in order, Transformers can process entire sequences in parallel, paying "attention" to different parts of the sentence simultaneously.

Key Concept: It maps a sequence of inputs (Prompt) to a sequence of outputs (Answer).

The Running Example

To understand how it works, we will follow a single piece of data through the entire network. Our task is simple: Predict the next word.

The cat sat on the ?

Target Answer: mat

The whole text the model is processing is called the "Context window". The original GPT-3 model had a context window of 2,048 tokens only, newer models have a context window of up to 1 million tokens, however there are severe limits.

When the prompt exceeds the context window, it is simply cut at the beginning. That's why models don't remember the beginning of our conversation over time.

Tokenization: Words to Numbers

Computers don't understand text. First, we must break the sentence into chunks called Tokens, and then convert each token into a unique ID from our vocabulary.

The ID: 3
cat ID: 4
sat ID: 5
on ID: 6
The ID: 3
What does a real token look like?

To the model, "cat" is just a number (e.g. ID 4). In real systems (like GPT), tokens can also be parts of words.

For example, "Tokenization" might be split into two tokens: "Token" and "ization" leading to a vector like [10346, 2860].

The model doesn't know that "Token" and "ization" are connected. It just sees the numbers 10346 and 2860.

The Architecture Map

Modern LLMs like chatGPT are Decoder-Only models.

They consist of a single stack of repeated blocks that process text to predict the next word from the bottom up.

The Stack

Predicted next word
Softmax
Linear
N ×
Add & Norm
Feed Forward
Add & Norm
Masked Multi-Head Attention
Input Embedding
Input (Prompt)

The Pre-trained "Brain" Parts

The diagram above shows the structure, but where is the "knowledge"? It is stored in the Parameters (matrices of numbers) inside these boxes.

All these numbers are learned during months of training by the big tech companies using massive datasets.

Embedding Matrix

Size: ~50,000 words × 12,288 dimensions

Goal: A lookup table with a vector for every word. In GPT-3, each word is a list of 12,288 numbers!

Attention Weight Matrices (WQ, WK, WV)

Size: 12,288 × 128 (per head)

Goal: These learn relationships. GPT-3 splits its massive 12,288 vector into 96 smaller "heads" of size 128 (12,288 ÷ 96 = 128). This allows 96 parallel streams of thought.

Attention Output Matrix (WO)

Size: 12,288 × 12,288

Goal: The "Mixer". After all 96 heads do their work, this matrix combines their results back into a single rich representation.

Feed Forward Network Weights

Size: 12,288 → 49,152 → 12,288

Goal: The "processing brain". It expands the information (4x larger) to analyze complex patterns, then compresses it back.

Final Linear Layer

Size: 12,288 inputs → ~50,000 outputs

Goal: Converts the final thought vector into a score for every single word in a dictionary.