Lets build GPT-3: Building Blocks (part 1)

Intro

This blog is the first in a three part series on building, training, and optimising GPT. This first blog in particular assumed some knowledge on how attention and Decoder Transformers work. For a more detailed dive you can check out my previous blog on building GPT here. The major differences between that implementation and this are:

Moving the layer normalisation, which previously was applied after each block, so it is now applied to the input of each block.

Added an additional layer normalisation to the final attention block, before the model head.

We will also make use of Lambda Labs to access GPUs, you can see my guide on setting this up here.

Base Code

We will start by defining our Causal Attention block. This code is mostly copied from the previous blog so we will not dive into much detail.


class CausalSelfAttention(nn.Module):
    """This is multi-head attention with causal masking."""

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch (3 * config.n_embd)
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        # regularization
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        # Mask
        self.register_buffer(
            "bias",
            torch.tril(torch.ones(config.block_size, config.block_size)).view(
                1, 1, config.block_size, config.block_size
            ),
        )

    def forward(self, x):
        B, T, C = x.size()  # Batch size, sequence length, and embedding dimension
        # Calculate query, key, and value for all heads in the batch and move head dimension up
        # nh = number of heads, hs = head size, C (n channels) = nh * hs
        # In GPT-2, nh = 12, hs = 64, C = 768
        qkv = self.c_attn(x)
        # Split the qkv tensor into three separate tensors: q, k, and v
        q, k, v = qkv.split(self.n_embd, dim=2)
        # Reshape the key, query, and value for multi-head attention. Essentially, makes nh as batch dimension parallel
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)

        # Attention (matrtlialzes the large TxT matrix for all queries, keys)
        # Scaled dot product attention
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        # Apply the causal mask
        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
        # Normalize the attention scores; sum = 1
        att = F.softmax(att, dim=-1)

        # Get output of the attention with the value
        y = att @ v  # (B, nh, T, hs)
        # Reshape the output to be a 3D tensor with the head
        y = y.transpose(1, 2).contiguous().view(B, T, C)  # (B, T, C)
        # Output projection
        y = self.c_proj(y)
        return y

This performs all the Masked multi headed attention work efficiently.

We then need to define our Multi Layer Perceptron


class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        # Linear layer to project input to 4 times the embedding dimension, allows for more complex representations
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        # GELU activation function - approximate tanh, is mostly redudant now but was used in the original GPT-2 paper
        self.gelu = nn.GELU(approximate="tanh")
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x

We will combine these together in a repeatable block with residual connections


class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x):
        # Residual connection around the layer
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

We define our model config (params chosen to match originally GPT paper)


@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50257  # Number of tokens in the dataset: 50k BPE merges + 256 UTF-8 tokens + special tokens
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768

We are then ready to put this all together in a GPT model class


class GPT(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config
        self.transformer = nn.ModuleDict(
            dict(
                wte=nn.Embedding(config.vocab_size, config.n_embd),
                wpe=nn.Embedding(config.block_size, config.n_embd),
                h=nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
                ln_f=nn.LayerNorm(config.n_embd),
            )
        )
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

    def forward(self, idx, targets=None):
        B, T = idx.size()
        assert (
            T <= self.config.block_size
        ), f"Cannot forward sequence of length {T}, block size is only {self.config.block_size}"

        # Get token and position embeddings
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
        pos = self.transformer.wpe(pos)
        tok_emb = self.transformer.wte(idx)
        x = tok_emb + pos

        for block in self.transformer.h:
            x = block(x)
        # Final Layer Norm
        x = self.transformer.ln_f(x)

        # Get the logits
        logits = self.lm_head(x)
        loss = None

        if targets is not None:
            # Compute the loss
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss

Loading from HuggingFace

In order to test our model definition above, we will load the model weights from HuggingFace and testing the model works as expected with these weights. In order to achieve this we will define a from_pretrained class method for our GPT class:


    @classmethod
    def from_pretrained(cls, model_type):
        """Loads pretrained GPT-2 model weights from huggingface"""
        assert model_type in {"gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl"}
        from transformers import GPT2LMHeadModel

        print(f"loading weights from pretrained gpt: {model_type}")

        # n_layer, n_head and n_embd are determined from model_type
        config_args = {
            "gpt2": dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
            "gpt2-medium": dict(n_layer=24, n_head=16, n_embd=1024),  # 350M params
            "gpt2-large": dict(n_layer=36, n_head=20, n_embd=1280),  # 774M params
            "gpt2-xl": dict(n_layer=48, n_head=25, n_embd=1600),  # 1558M params
        }[model_type]

        config_args["vocab_size"] = 50257  # always 50257 for GPT model checkpoints
        config_args["block_size"] = 1024  # always 1024 for GPT model checkpoints
        # create a from-scratch initialized minGPT model
        config = GPTConfig(**config_args)
        model = GPT(config)
        sd = model.state_dict()
        sd_keys = sd.keys()
        sd_keys = [k for k in sd_keys if not k.endswith(".attn.bias")]  # discard this mask / buffer, not a param

        # init a huggingface/transformers model
        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
        sd_hf = model_hf.state_dict()

        # copy while ensuring all of the parameters are aligned and match in names and shapes
        sd_keys_hf = sd_hf.keys()
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith(".attn.masked_bias")]  # ignore these, just a buffer
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith(".attn.bias")]  # same, just the mask (buffer)
        transposed = ["attn.c_attn.weight", "attn.c_proj.weight", "mlp.c_fc.weight", "mlp.c_proj.weight"]
        # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla Linear
        # this means that we have to transpose these weights when we import them
        assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
        for k in sd_keys_hf:
            if any(k.endswith(w) for w in transposed):
                # special treatment for the Conv1D weights we need to transpose
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                # vanilla copy over the other parameters
                assert sd_hf[k].shape == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])

        return model

This method is mostly a simple loading of weights, although some added care is taken with some of the weights as they are transposed to what is expected in the PyTorch implementation. This is because the original GPT model was trained with TensorFlow.

Testing our pre trained model

We can test that the pre-trained model generates sensible outputs using a simple loop. First we will check if. GPU support is available, then encode an input sentence and feed this into the model to generate the next token. We feed this output back in to keep generating tokens until the max tokens limit is hit:


device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"
print(f"Using device: {device}")

num_return_sequences = 5
max_length = 30

model = GPT(GPTConfig())
model.eval()
model.to(device)

# Prefix tokens
import tiktoken

enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode("Hello, how are you?")
tokens = torch.tensor(tokens, dtype=torch.long)
x = tokens.unsqueeze(0).repeat(num_return_sequences, 1).to(device)

torch.manual_seed(42)
# Loop and keep adding tokens to the sequence
while x.size(-1) < max_length:
    # No need to compute gradients
    with torch.no_grad():
        # Get logits from last column
        logits, _ = model(x)
        logits = logits[:, -1, :]
        # Get the probabilities over vocab
        probs = F.softmax(logits, dim=-1)
        # Get top 50 probabilities and indices
        topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)
        # Sample from the top 50 probabilities
        ix = torch.multinomial(topk_probs, num_samples=1)
        # Get the sampled token
        xcol = torch.gather(topk_indices, dim=-1, index=ix)
        # Add the sampled token to the sequence
        x = torch.cat((x, xcol), dim=1)

for i in range(num_return_sequences):
    tokens = x[i, :max_length].tolist()
    decoded = enc.decode(tokens)
    print(">", decoded)


# OUTPUT
Using device: cpu
Hello, how are you? resurrection Ek Prel wonderingate adjustment ranges���� viewer Fitzgerald� agendadm 1 transportsantine lectidan Minor Caldwell flora Bar ambitious Overs
Hello, how are you? Horror 2006oad SLI falls Shea FSA Detectiveolated wornAnt425 EG Rud opposes Jaune captive Olympic spar pertinent awe genius inclined exodus
Hello, how are you? cra Comfort Notes aromaticiddlesstro thrief chlor fostering packed owlylesuti faire Mou ConductPacldom orally collects��Leod equation
Hello, how are you?Kansas Hare Martian enchanted008piringurring pic Totem Kardash Consult tread KalReward Balk boneacent volcanic====== DaddyNYices 195 arrangements
Hello, how are you?tted strawberry Jazeerawolf Removequiteo paternalfaced propagate obscurityLeave Efficiencyresent out stick watch legends Barbarian Wilhelm Chronicinabasketball Bran

Data Loading

We can code up a simple data loader to batch and tokenise our dataset. This create batches of a given sequence length and targets which are the next token in the sequence.


class DataLoaderLite:
    def __init__(self, B: int, T: int):
        self.B = B
        self.T = T
        with open("data/input.txt", "r", encoding="utf-8") as file:
            text = file.read()
        enc = tiktoken.get_encoding("gpt2")
        tokens = enc.encode(text[:1000])
        self.tokens = torch.tensor(tokens, dtype=torch.long)
        logger.info(f"Loaded {len(self.tokens)} tokens")
        logger.info(f"1 Epoch: {len(self.tokens) // (self.B * self.T)} batches")

        self.current_pos = 0

    def next_batch(self):
        B, T = self.B, self.T
        buf = self.tokens[self.current_pos : self.current_pos + B * T + 1]
        x = buf[:-1].view(B, T)
        y = buf[1:].view(B, T)
        self.current_pos += B * T
        # If we've reached the end of the tokens, reset the position
        if self.current_pos + (B * T + 1) >= len(self.tokens):
            self.current_pos = 0
        return x, y

Weight Initialisation

GPT uses a normal distribution to initialise the weights of the network. This can be achieved in PyTorch:



class GPT(nn.Module):
    def __init__(self, config: GPTConfig):
		    
		    ....
		    
        # Initialize the weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            # Initialize the weights with a normal distribution STD of 0.02
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):

We need to slightly adjust the initialisation due to the residual connections which also contribute to the residual connections. So we add a scaling factor of . We can so this by adding a NANOGPT_SCALE_INIT flag to our MLP layers and updating the std as below:


  std = 0.02
  if hasattr(module, 'NANOGPT_SCALE_INIT'):
      std *= (2 * self.config.n_layer) ** -0.5

Common alternatives to this approach would be to use Xavier initialisation, which is functionality very similar. The difference with Xavier is that defines the std as:


torch.nn.init.xavier_normal_(tensor, gain=1.0)

Weights Sharing

Note that in GPT and some other transformers the weights for the input embeddings are shared with the output classification layer. This weight tensor will now be used twice in the forward pass and will accumulate gradients twice. This is a common technique in language models, particularly in the GPT-2 paper and original Attention is All You Need paper

This weight tensor accounts for about 1/3 of the total parameters (~40M) in the model so it is computationally efficient to share the parameters. We also expect the parameters to be similar as they are performing functionally similar jobs - mapping from embedding layer to vocab.

It is relatively simple to implement in PyTorch as below:


class GPT(nn.Module):
    def __init__(self, config: GPTConfig):
        
        ...
        
        self.transformer.wte.weight = self.lm_head.weight

Training Loop

Our training loop is simple, we will first load the model and send it to the GPU if its available. We use our data loader to loop through batches and make predictions. We use the AdamW optimiser and backpropagate our loss through the network.


  device = "cpu"
  if torch.cuda.is_available():
      device = "cuda"
  elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
      device = "mps"
  print(f"Using device: {device}")

  torch.manual_seed(1337)
  if device == "cuda":
      torch.cuda.manual_seed(1337)

  # Get batches for training
  train_loader = DataLoaderLite(B=4, T=32)

  model = GPT(GPTConfig())
  model.to(device)

  optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
  for i in range(30):
      optimizer.zero_grad()
      x, y = train_loader.next_batch()
      x = x.to(device)
      y = y.to(device)
      logits, loss = model(x, y)
      loss.backward()
      optimizer.step()
      print(f"Loss: {loss.item()} | Iteration: {i}")

Summary

In this blog we built all the basic building blocks for GPT using only Python and PyTorch. These included:

Efficient Multi Headed Attention Mechanism

Residual Multi Layer Perceptron

A forward pass the predict the next token and computes the loss

A method to load pre-trained weights from HuggingFace

A simple method to generate text

A Data loader to encode and batch our inputs and define target labels

A effective weight initialisation method and weight sharing scheme

A training loop to train our model

Next steps

In the next article we will focus on optimising our training process using mixed precision, compiling the model, flash attention and more! You can find that post here.