Building an LLM Tokenizer

Intro

Tokenisation It is the process of breaking down text (or other inputs types) into smaller subword units, known as tokens. In this guide we will break down the fundamental principles of tokenisation and tokenisers, as well as implementing a tokeniser from scratch just using Python. We will also explore some common libraries used for building tokenisers.

Figure 1: Tokenisation - encoding/decoding

What is Tokenisation?

Natively LLMs do not understand raw text inputs, thus we need to represent our input numerically. Tokenisation breaks down text into tokens, and each token is assigned a numerical representation. This is the first and final step in the language model, converting the output back into a text representation.

The overall process LLM process works as below:

Encode the input text into tokens using a tokeniser. Each token is assigned a specific number in the tokeniser’s vocabulary.

Tokens are passed through the model, which typically includes an embedding layer and transformer blocks. The embedding layer converts the tokens into dense vectors that capture semantic meanings. The transformer blocks then process these embedding vectors to generate results.

The output is decoded using the tokeniser back to text. This is done by mapping the tokens back to their corresponding words using the tokeniser’s vocabulary.

Naive Tokenisation

The simplest tokenisers operate on word-level or character-level tokenisation. This involves a simple mapping of word/char → index in vocab.

Find All Unique Characters in the dataset

Create character-to-integer and integer-to-character lookup tables for converting strings into tokens.

Convert each integer token into a vectorised representation using an embedding table. This embedding table will have rows equal in number to the vocabulary size.

The problem with this approach is that it does not account for languages other than the language for which you created vocabulary lookup tables. Additionally due to the small vocabulary size, the sequence length of tokenised sentences will be much larger. As a result, transformers will process much less data at a time due to their fixed window size.

Similarly word-level tokenisation suffers due to its very large vocabulary size, meaning the embedding table and the language model head will have more rows, increasing the computational complexity of the model.

Efficient Tokenisation With Byte-Pair Encoding

Modern tokenisers use subword-level tokenisation which allows for more complex text representation. This is typically performed using Byte-Pair Encoding (BPE). This subword representation can help the model understand words it has never seen before and make computation more efficient than char level representation.

The BPE algorithm works by finding the most frequent pair of bytes in the training data and merging them into a single token. This process is repeated until the desired vocabulary size is reached.

BPE Example:

Suppose the data to be encoded is

aaabdaaabac

The byte pair aa occurs most often, so it will be replaced by a byte that is not used in the data, such as Z. Now there is the following data and replacement table:


ZabdZabac
Z=aa

Then the process is repeated with byte pair ab, replacing it with Y:


ZYdZYac
Y=ab
Z=aa

The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with recursive byte pair encoding, replacing ZY with X:


XdXac
X=ZY
Y=ab
Z=aa

This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.

How to build a Tokeniser

When building a tokeniser, we want to ensure the tokeniser is efficient and handles other languages and special characters.

Unicode for tokenisation

We will use the Unicode standard to build our tokeniser. This contains a list of all the characters in the world (~150k), and their corresponding Unicode integer. Using Unicode directly is not efficient for LLMs as it is too large, requiring more computation in the embedding layer and output layer, and also the standard is continuously updated.

We can use the unicode ord function to convert any character of any language to an integer unicode representation.


print([ord(x) for x in "안녕하세요 👋 (hello in Korean!)"])


# OUTPUT
[50504, 45397, 54616, 49464, 50836, 32, 128075, 32, 40, 104, 101, 108, 108, 111, 32, 105, 110, 32, 75, 111, 114, 101, 97, 110, 33, 41]

We can use the encode method to convert the string to a list of integers, based on the UTF-8 encoding. UTF-8 is generally considered the best encoding for this purpose because it offers better backward compatibility with ASCII, and a simpler character encoding scheme.


print(list("안녕하세요 👋 (hello in Korean!)".encode("utf-8")))


# OUTPUT
[236, 149, 136, 235, 133, 149, 237, 149, 152, 236, 132, 184, 236, 154, 148, 32, 240, 159, 145, 139, 32, 40, 104, 101, 108, 108, 111, 32, 105, 110, 32, 75, 111, 114, 101, 97, 110, 33, 41]

However, UTF-8 encoding for tokenisation presents significant challenges:

Limited Vocabulary Size: The vocabulary size would be restricted to 256, corresponding to the number of possible byte values.

Long Tokenised Sequences: This limited vocabulary size leads to much longer tokenised sequences.

We can use the BPE algorithm above to increase the vocab size while reducing the sequence lengths at the same time.

BPE Implementation

The first step in the implementation is to get the counts for each pair of bytes in the sequence.


def get_stats(ids):
    counts = defaultdict(int)
    for pair in zip(ids, ids[1:], strict=False):
        counts[pair] += 1
    return counts

We can then take the most frequent byte pair and generate a new id token.


stats = get_stats(ids)
top_pair = max(stats, key=stats.get)
idx = 256 + i

Next we will replace the existing the byte pair occurrences with this new id token.


def merge_tokens(ids, pair, idx):
    new_ids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
            new_ids.append(idx)
            i += 2
        else:
            new_ids.append(ids[i])
            i += 1
    return new_ids

Now we can repeat this process. The more steps we take, the larger the vocabulary and the shorter the sequences. Initially the vocab size is set to 256 (max allowed by UTF-8).This is a tuneable hyperparameter we can optimise during training.


vocab_size = 276
num_merges = vocab_size - 256   # 20 merges
ids = list(tokens)

merges = {} # (int, int) -> int

# Loop through to get the desired vocab size
for i in range(num_merges):
		# Get the most frequent pair
    stats = get_stats(ids)
    top_pair = max(stats, key=stats.get)
    idx = 256 + i
    print(f"merging {top_pair} into {idx}")
    # Replace most frequent pair with new token
    ids = merge_tokens(ids, top_pair, idx)
    merges[top_pair] = idx
    
print(f"Tokens length: {len(tokens)}")
print(f"IDs length: {len(ids)}")
print(f"Compression ratio: {len(tokens) / len(ids):.2f}")


# OUTPUT
merging (101, 32) into 256
merging (105, 110) into 257
merging (115, 32) into 258
merging (116, 104) into 259
merging (101, 114) into 260
merging (99, 111) into 261
merging (116, 32) into 262
merging (226, 128) into 263
merging (44, 32) into 264
merging (97, 110) into 265
merging (111, 114) into 266
merging (100, 32) into 267
merging (97, 114) into 268
merging (101, 110) into 269
merging (257, 103) into 270
merging (261, 100) into 271
merging (121, 32) into 272
merging (46, 32) into 273
merging (97, 108) into 274
merging (259, 256) into 275

Tokens length: 24597
IDs length: 19438
Compression ratio: 1.27

Encoding

Given an input string, encoding is the process of converting this to an integer representation in the range [0, vocab_size].


def encode(text):
		
		# Convert to Unicode UTF-8
    tokens = text.encode("utf-8")
    # While byte pairs can be merged
    while True:
		    # Get the byte pair frequencies
        stats = get_stats(tokens)
        # Find the token with the lowest merge rank 
        pair = min(stats, key=lambda p: merges.get(p, float('inf')))
        # If this pair is not in merges then we are done
        if pair not in merges:
            break
        # Update the tokens list
        tokens = merge_tokens(tokens, pair, merges[pair])

    return tokens


encode("Hello, world!")


[72, 101, 108, 108, 111, 264, 119, 266, 108, 100, 33]

Decoding

Decoding in the reverse process of going from integer values to a string representation.


# Initialise a byte mapping
vocab = {i: bytes([i]) for i in range(256)}
# Append our merged byte pairs to the mapping
for (p0, p1), idx in merges.items():
    vocab[idx] = bytes(p0) + bytes(p1)

def decode(ids):
		# Map the integers to bytes and join in byte string
    tokens = b"".join([vocab[x] for x in ids])
    # Decode bytes to string
    text = tokens.decode("utf-8", errors="replace")
    return text

Extra Considerations

Regex Patterns to Enforce Splits

Sometimes we want to enforce the merging rules, which stop certain types of tokens being merged together. For example, for the tokens: 'Dog?', 'Dog!', 'Dog.', we can image this may be merged as ‘Dog’ + ‘?’ into a single token = ‘Dog?’. But this is inefficient where we limited vocab size and we would rather retrain the token ‘dog’ and ‘?’ separately. Thus, we can enforce some merging constraints using regular expressions that will stop tokens being merged if that match certain patterns.


gpt2pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

print(re.findall(gpt2pat, "Hello've world123 how's are you!!!?"))


['Hello', "'ve", ' world', '123', ' how', "'s", ' are', ' you', '!!!?']

s, 't, 're, 've, 'm, 'll, 'd matches common word contractions. This will stop Hello + ve merging to from Hello’ve.

?\p{L}+ this regex pattern captures sequences of chars. This stops world + 123 merging to form world123

?\p{N}+ this regex pattern captures sequences of numbers. This stops 123 merging with the trailing whitespace after it

?[^\s\p{L}\p{N}]+ this regex pattern captures sequences of punctuation or special characters. This will stop !!!? merging with the other char.

s+(?!\S) by using this regex pattern, the tokeniser ensures that large whitespaces are not merged together with the following non-whitespace characters. So a string with wide whitespace e.g. Hello World would be tokenised to [Hello, , World ]. Note single whitespace preceded the token, but no white space will follow the token. This is why LLMs struggle to handle trailing whitespace because tokenisers to no encode them properly.

\s+ this ensures the tokeniser can effectively handle whitespace characters, this is mainly for trailing whitespace characters. This is a catch if not caught by the above patterns.

The reason this approach works is that we are first splitting up the input sequence using the regex patterns to determine which splits to make. All the elements in the list of processed independently by the tokeniser and the results are then concatenated. Thus we would never see the first two elements of the above array Hello and 've merged together as they are processed separately.

An improve version of the above would use the re.ignorecase to ensure the tokenisation is consistent across lower and upper case

A drawback of the above approach is that it is very english language specific, thus in order to effectively

In the GPT-2 tokeniser an additional rule was added so whitespaces were never merged together, would be tokenised to [ ]. This can cause problems for programming in Python which relies on indentation for functionality, thus the size of whitespace has meaning. So with newer tokenisers (e.g. GPT-4) this was updated so whitespaces can be merged. GPT-4 also ensures that only a max of 3 numbers can ever be merged together.

TikToken

TikToken is a library created by OpenAI for GPT tokenisers. It can be used for inference but it can’t be used to training a new tokeniser. This library can be used as below for efficient tokenisation


import tiktoken


# GPT-2 (does not merge spaces)
enc = tiktoken.get_encoding("gpt2")
print(enc.encode("    hello world!!!"))

# GPT-4 (merges spaces)
enc = tiktoken.get_encoding("cl100k_base")
print(enc.encode("    hello world!!!"))


[220, 220, 220, 23748, 995, 10185]
[262, 24748, 1917, 12340]

Special Tokens

Special tokens are tokens that are used to represent special characters or sequences in the text. For example, the <|endoftext|> token is used to represent the end of the text and split documents in a text. This tells the model that any following information is not relevant to the previous text, and vice-versa. The LLM is expected to learn this meaning based on examples, rather than having any predefined meaning.

The special tokens do not go through the BPE algorithm, i.e. <|endoftext|> is represented as a single token, whereas a slight variation of this <|endoftex would be represented as multiple tokens derived from BPE.

Its possible to extend the tokeniser with new special tokens.


cl100k_base = tiktoken.get_encoding("cl100k_base")

enc = tiktoken.Encoding(
    # If you're changing the set of special tokens, make sure to use a different name
    # It should be clear from the name what behaviour to expect.
    name="cl100k_im",
    pat_str=cl100k_base._pat_str,
    mergeable_ranks=cl100k_base._mergeable_ranks,
    special_tokens={
        **cl100k_base._special_tokens,
        "<|im_start|>": 100264,
        "<|im_end|>": 100265,
    }
)

When we add tokens, we also need to update our LLM. For example we will need to extend the embedding matrix, by adding a row for the new token. We also need to add another output node prior to the softmax layer.

SentencePiece

Commonly used because (unlike tiktoken) it can efficiently both train and inference BPE tokenisers. It is used in both Llama and Mistral series.

The big difference: SentencePiece runs BPE on the Unicode code points directly. It then has an option character_coverage for what to do with very very rare codepoints that appear very few times, and it either maps them onto an UNK (unknown) token, or if byte_fallback is turned on, it encodes them with utf-8 and then encodes the raw bytes instead.


import sentencepiece as spm
import os


options = dict(  # noqa: C408
    # input spec
    input="toy.txt",
    input_format="text",
    # output spec
    model_prefix="tok400", # output filename prefix
    # algorithm spec
    # BPE alg
    model_type="bpe",
    vocab_size=400,
    # normalization
    normalization_rule_name="identity", # ew, turn off normalization
    remove_extra_whitespaces=False,
    input_sentence_size=200000000, # max number of training sentences
    max_sentence_length=4192, # max number of bytes per sentence
    seed_sentencepiece_size=1000000,
    shuffle_input_sentence=True,
    # rare word treatment
    character_coverage=0.99995,
    byte_fallback=True,
    # merge rules
    split_digits=True,
    split_by_unicode_script=True,
    split_by_whitespace=True,
    split_by_number=True,
    max_sentencepiece_length=16,
    add_dummy_prefix=True,
    allow_whitespace_only_pieces=True,
    # special tokens
    unk_id=0, # the UNK token MUST exist
    bos_id=1, # the others are optional, set to -1 to turn off
    eos_id=2,
    pad_id=-1,
    # systems
    num_threads=os.cpu_count(), # use ~all system resources
)

spm.SentencePieceTrainer.train(**options)

One of the main drawback of SentencePiece is the large number of parameters then need to be set in order to work effectively. Generally a good rule of thumb is to use parameters similar to those used in Llama-2 or 3. Additionally SentencePiece is a slightly older library with some outdated concepts so has the potential to go wrong.

Consideration

Vocab size is a hyperparameter

So size what should be vocab size be?

How can I increase vocab size?

When we grow our vocab, our linear layer and embedding matrix will grow in size. This will increase the number of parameters in our model, and increase the computation cost of the model.


class GPT1(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.blocks = nn.Sequential(*[Block(n_embed, n_heads=n_heads) for _ in range(n_layers)])
        self.ln_f = nn.LayerNorm(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

If we want to increase the vocab size, we need to increase the vocab size parameter in the model. We will then need to train the model again. Often people will load a pre-trained model weights then freeze all parameters except the embedding matrix and linear layer. They then train the model on a new dataset with the new vocab size, so that the model learns the new tokens.

Looking Forward

Multimodel Domain Adaptation

An approach that is seeing a lot of adoption in multimodel adaption is to tokenise these other modalities as tokens. This means we will tokenise our Audio, Image, Video inputs and then feed them into our LLM just as we would with regular text input.

New models like SORA use this approach. SORA create ‘visual patches’ in the same way that GPT has text tokens. These patches can then be fed into the attention mechanism to learn relations.