Embedding Matrix
Goal: A lookup table with a vector for every word. In GPT-3, each word is a list of 12,288 numbers!
G - Generative: It generates new text, one word at a time.
P - Pre-trained: Notice the colored blocks (Parameters) further down the page? These represent the learned weights (knowledge) the model acquired during training. It already "knows" what to do before we give it input.
T - Transformer: The entire process described on this page—processing sequences in parallel to understand meaning—is the Transformer architecture.
The Transformer is a deep learning architecture introduced in 2017 that revolutionized Natural Language Processing. Unlike previous models that processed words one by one in order, Transformers can process entire sequences in parallel, paying "attention" to different parts of the sentence simultaneously.
To understand how it works, we will follow a single piece of data through the entire network. Our task is simple: Predict the next word.
Target Answer: mat
The whole text the model is processing is called the "Context window". The original GPT-3 model had a context window of 2,048 tokens only, newer models have a context window of up to 1 million tokens, however there are severe limits.
When the prompt exceeds the context window, it is simply cut at the beginning. That's why models don't remember the beginning of our conversation over time.
Computers don't understand text. First, we must break the sentence into chunks called Tokens, and then convert each token into a unique ID from our vocabulary.
To the model, "cat" is just a number (e.g. ID 4). In real systems (like GPT), tokens can also be parts of words.
For example, "Tokenization" might be split into two tokens: "Token" and "ization" leading to a vector like [10346, 2860].
The model doesn't know that "Token" and "ization" are connected. It just sees the numbers 10346 and 2860.
Modern LLMs like chatGPT are Decoder-Only models.
They consist of a single stack of repeated blocks that process text to predict the next word from the bottom up.
The diagram above shows the structure, but where is the "knowledge"? It is stored in the Parameters (matrices of numbers) inside these boxes.
All these numbers are learned during months of training by the big tech companies using massive datasets.
Goal: A lookup table with a vector for every word. In GPT-3, each word is a list of 12,288 numbers!
Goal: These learn relationships. GPT-3 splits its massive 12,288 vector into 96 smaller "heads" of size 128 (12,288 ÷ 96 = 128). This allows 96 parallel streams of thought.
Goal: The "Mixer". After all 96 heads do their work, this matrix combines their results back into a single rich representation.
Goal: The "processing brain". It expands the information (4x larger) to analyze complex patterns, then compresses it back.
Goal: Converts the final thought vector into a score for every single word in a dictionary.