Building GPT From Scratch

Intro

In this post we will build a GPT model from scratch. This includes preparing the dataset, implementing multi-head attention mechanism, layer normalisation, and putting it together to generate sequences of text from a trained model.

The code for this project can be found at the following repo:

gpt1.py

philiprj

Our model will be a decoder only transformer, meaning each token will only attend to tokens earlier in the sequence. This is different to Encoder Bi-Directional models like BERT which can attend to both prior and subsequent tokens. We will train our model on a next character prediction task so that the fully trained model can generate new text.

In order to follow along you will need to install and import PyTorch


!pip install torch


import torch
import torch.nn as nn
from torch.nn import functional as F

We will use the following hyperparameters in training and initialising the model


batch_size = 64  # how many independent sequences will we process in parallel?
block_size = 256  # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
device = "cuda" if torch.cuda.is_available() else "cpu"
eval_iters = 200
n_embed = 160
n_heads = 4
n_layers = 4
dropout = 0.2
# ------------
torch.manual_seed(1337)

Data

For this model we will be using the Shakespeare dataset which contains the full works of Shakespeare. You can retrieve this 1MB dataset from the below URL:


https://raw.githubusercontent.com/philiprj/llm_playground/refs/heads/main/data/input.txt

Once downloaded we can load this data:


with open("data/input.txt", encoding="utf-8") as f:
    text = f.read()

To represent the data in way our model can understand we will create an encoding where each character in the text is mapped to an integer.


# here are all the unique characters that occur in this text
chars = sorted(set(text))
vocab_size = len(chars)
# create a mapping from characters to integers
str2int = {ch: i for i, ch in enumerate(chars)}
int2str = dict(enumerate(chars))

def encode(s: str) -> list[int]:
    return [str2int[c] for c in s]  # encoder: take a string, output a list of integers


def decode(li: list[int]) -> str:
    return "".join([int2str[i] for i in li])  # decoder: take a list of integers, output a string

Then we can encode the data as a PyTorch Tensor


data = torch.tensor(encode(text), dtype=torch.long)

In order to evaluate our model we need to keep a reference validation dataset. This dataset can also be tested against during training to ensure our model is not overfitting the training data. For this example we will use a 90:10 training/validation split


n = int(0.9 * len(data))  # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

To optimise our model training we will process chunks of data in parallel. This process is called batching:


def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == "train" else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i : i + block_size] for i in ix])
    y = torch.stack([data[i + 1 : i + block_size + 1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

The above works by randomly starting the sequence at a batch_size number of points in the input text, then selecting the next block_size number of characters. These snippets are then stacked on top of each other creating a torch tensor of (block_size x batch_size). The targets for the model are then just the next character after the sequence.

Model: GPT

We will be building a decoder only model based on GPT. This model uses a Feed Forward Neural Network, a Masked Multi-Head Attention mechanism, Layer Normalisation, Residual Connections, Positional Encoding, and Output Classification Layer.

Model Training

We will train the model using the Cross Entropy Loss which can be initialised using PyTorch:


loss = F.cross_entropy(logits, targets)

We will also use a helper function to help calculate the loss during evaluation over a set number of evaluation iterations. Performing this over the training and validation dataset will help us understand if our model is overfitting during training and enable us to terminate the training process early. We use the @torch.no_grad() to ensure gradients are not accumulated during evaluation.


@torch.no_grad()
def estimate_loss(model):
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            _, loss = model(x, y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

We will use the AdamW optimiser for model training:


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

And the full training loop can be seen below:


for iter in range(max_iters):
    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss(model)
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch("train")

    # evaluate the loss
    logits, loss = model(xb, yb)
    # Ensure gradients are cleared
    optimizer.zero_grad(set_to_none=True)
    # Backpropogate the loss through the network and take an optimizer step
    loss.backward()
    optimizer.step()

Model Properties

Attention

Attention is a mechanism that allows transformer models to attend to different parts of a sequence when making predictions. It allows the model to understand the relationships between different words or phrases in the sequences.

The node does not necessarily need to attend to nodes in the same sequence, if it is attending to nodes from another sequence this is called cross-attention. This is used in the encoder-decoder model so the attention heads in the decoder model can attend to the outputs of the encoder model, and the outputs already generated by the decoder model.

When the node is attending to nodes in the same input sequence, this is known as self attention. In an encoder we can look at the entire context when making predictions, in the decoder masking is used to ensure the model only attends to previous tokens in the sequence.

Self Attention

Every token in the sequence emits a query, key and value. To compute the attention scores for each token in the sequence, we need to compute the dot product of the query and key for each token in the sequence. This gives us the weights for the self attention mechanism.

Attention is a communication mechanism. It can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.

An attention head is a network that can learn different features based on the query, key, value:


# Initialise the head to perform the self attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

# Key: Information about what each token has
k = key(x)  # (B, T, head_size)
# Query: Information about what each token is looking for
q = query(x)  # (B, T, head_size)
# Value: Information about the token I will pass on if you are looking for it
v = value(x)  # (B, T, head_size)

The networks are then applied to the queries, keys, values emitted by each. The dot product of the query and key outputs are then taken:


# Compute the attention scores
wei = q @ k.transpose(-2, -1)

The Softmax is then taken to normalize the sum to 1. This gives us the weights for the self attention mechanism.


wei = F.softmax(wei, dim=-1)

These weights are then applied to the values emitted by each node, producing the output of the attention mechanism.


# Output is dot product of the attention scores and the value
out = wei @ v

Note that in this implementation of

There is no notion of space - attention simply acts over a set of vectors. This is why we need to positionally encode tokens.

Each example across batch dimension is processed completely independently.

In an encoder attention block all nodes to communicate.

In a decoder attention block the future tokens are masked, only the previous tokens can be seen. The efficient method to achieving this is done by create a lower triangular matrix of ones, then filling the 0s to be thus when we normalise these values will become 0:


# Apply the mask to the attention scores
tril = torch.tril(torch.ones(T, T))    # Created lower triangular matrix of 1s
wei = wei.masked_fill(tril == 0, float('-inf'))    # Wills upper 0s with -inf

An additional step is added to “Scale" the attention. This makes it so when input Q,K are unit variance, will be unit variance too and Softmax will stay diffuse and not saturate too much. This is important because it ensures that the attention scores are not too large or too small, this may happen if the attention scores are highly negative or positive, leading to a very peaked distribution (close to 1-hot vectors):


wei = q @ k.transpose(-2, -1) * head_size ** -0.5

Multi-Head Attention

Multi-Head attention is simply the stacking of multiple single heads of attention blocks. This enables each head to capture different aspects of input data and also improves overall performance.

Richer representation: For example, one head might focus on subject-verb agreement, while another might focus on word order.

Improved performance: allows for parallel computation, which enhances efficiency.

Enhanced generalisation: helps models generalise better to unseen data by learning multiple types of dependencies.

Flexibility and adaptability: can handle words that have different meanings in different contexts.

Reduced computational cost and memory usage: The dimensionality of each sub-vector is smaller than the original vector.

This can be simply implemented as below:


class MultiHeadAttention(nn.Module):
    """Multiple heads of self-attention in parallel"""

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embed, n_embed)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

Or for computational efficiency, it is typically implemented as just another dimension in the Attention Head Block:


# query, key, values for all heads in batch and move head forward to be the batch dim
q, k, v  = self.c_attn(x).split(self.n_embd, dim=2)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

Residual Connections

Residual connections (or skip connections) help train deep networks more effectively by allowing the network to pass information from earlier layers directly to later layers, bypassing one or more layers.

Mathematically they can be seen as an input of a layer being added to the output of that layer:

Or programatically the forward layer becomes:


def forward(self, x):
    x = x + self.sa_heads(x)  # (B,T,C)
    x = x + self.ffwd(x)  # (B,T,C)
    return x

Residual Connections help reduce the likelihood of vanishing gradients in large networks by providing a direct path for gradients to flow back through the network.

Residual connections ensure that even if deeper layers do not learn useful features, earlier layers’ outputs are still passed through, preventing degradation.

By adding the input to the output of each layer, residual connections encourage each layer to build upon the representations from previous layers, enabling more efficient feature reuse.

Layer Normalisation

Normalisation ensures the output has the same mean and variance as the inputs. This effectively encourages the outputs to fit in some standardised range.

The benefits of this are:

Increased stabilisation during training

Faster training to get to optimal parameters

To do this we subtract the mean and divide by the standard deviation for our layer. Mathematically:

, are learnable parameters for each layer. Thus the values will be updated during training.

In a network this could be applied for each layer, or as in the transformers architecture, after the Attention later, Feed forward layer, and before the output classification layer.

In code this can be simply implemented using PyTorch


self.ln = nn.LayerNorm(n_embed)
x = self.blocks(x)
x = self.ln(x)

Language Model Head: Classification Layer

The language model head simply takes the output from the final transformer block (Attention + Layer Norm + Feed Forward) and applies a Linear Layer, outputting a value for each token in our vocabulary.


self.lm_head = nn.Linear(n_embed, vocab_size)

During text generation a Softmax activation is then applied to this, providing us with a probability distribution over the vocabulary. The next token is then chosen from this distribution.


probs = F.softmax(logits, dim=-1)  # (B, C)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)

Appendix: Extra Model Details

Positional Encoding

In textual inputs the order of the sequence is important to understanding the meaning. If we rearrange the text we can come up with completely different meanings. Transformers do not natively understand the concept of sequence order, which can be a big benefit in terms of parallel processing, but we means we need to encode this understanding somehow.

Transformers use Positional Encoding, which comes in two type:

Absolute Positional Encoding

Rotary Position Embedding

Absolute Position Encoding:

Generally Sinusoidal Positional Encoding is used here, used in the original “Attention Is All You Need” paper, and works by:

Generating a unique vector for each position in the sequence, using a combination of sine and cosine functions.

The encoding vector has the same dimensionality as the token embeddings, allowing them to be simply added together.

Different dimensions in the encoding vector correspond to sinusoids of different frequencies, creating a spectrum from high to low frequencies.

This approach allows the model to easily attend to relative positions, as the encoding for any fixed offset can be represented as a linear function of the encoding at a given position.

This approach is effective because is allows for Efficient Computation of Relative Positions by applying a simple linear transformation to the encoding of the current position. This allows the attention mechanism to focus on relative distances between tokens. And because of the Ease of Learning Position-Relative Patterns enabling the model to learn patterns that depend on relative positions rather than absolute positions.

Rotary Positional Embedding (RoPE)

Rotary Position Embedding (RoPE), introduced in the paper “RoFormer: Enhanced Transformer with Rotary Position Embedding”, used a different approach based on:

Instead of adding separate positional encodings, RoPE rotates the query and key vectors based on their position in the sequence.

The rotation preserves the magnitude (maintaining token similarity) while encoding positional information in the angle

This approach allows RoPE to integrate both token and positional information into a single operation, making it more efficient and potentially more effective than other positional encoding methods.