I am creating a GPT-like model from scratch to improve my understanding of tokenization, encoding and the architecture in itself. As such, I tried to create my own, fairly naĆÆve BPE encoding using for-loops, but as you could guess, it was terribly slow for large vocabularies with many merges. I decided to use the Huggingface Tokenizer class from the tokenizers library with the following function to create a tokenizer:
def create_tokenizer(corpus_path, tokenizer_path, vocab_size):
# Initialize a tokenizer with BPE model
tokenizer = Tokenizer(BPE(unk_token='<UNK>'))
tokenizer.pre_tokenizer = Whitespace()
# Initialize the trainer
trainer = BpeTrainer(vocab_size=vocab_size, special_tokens=['<UNK>'])
# List of files to train on
files = [corpus_path]
# Train the tokenizer
tokenizer.train(files, trainer)
# Saving the tokenizer
tokenizer.save(tokenizer_path)
print(f"Succesfully saved tokenizer at {tokenizer_path}")
return tokenizer
Which actually works well to create a tokenizer. The resulting .json file also looks good:
{
"version": "1.0",
"truncation": null,
"padding": null,
"added_tokens": [
{
"id": 0,
(......)
"vocab": {
"<UNK>": 0,
"'": 1,
",": 2,
"-": 3,
".": 4,
"_": 5,
"a": 6,
"b": 7,
"c": 8,
"d": 9,
"e": 10,
"f": 11,
"g": 12,
"h": 13,
"i": 14,
Then, when I try to load it using the following method:
def load_model(model_class, config):
model = model_class(**config['Hyperparameters'])
tokenizer = Tokenizer(BPE(unk_token='<UNK>'))
tokenizer.from_file(config['Files']['tokenizer'])
# DEBUG
print("Loaded tokenizer from file:", config['Files']['tokenizer'])
vocab = tokenizer.get_vocab()
print("Vocabulary size:", len(vocab))
print("Sample entries from vocabulary:", dict(list(vocab.items())[:10]))
# GUBED
model.set_tokenizer(tokenizer, config)
model.load_state_dict(torch.load(config['Files']['model'], map_location=config['Hyperparameters']['device']))
model.eval()
return model
trained_model = load_model(gpt.GPT, config)
Loaded tokenizer from file: tiny-llm/tokenizers/nature.json
Vocabulary size: 0
Sample entries from vocabulary: {}
Tokenizer succesfully set
Yields an empty vocabulary, as-if the tokenizer is not instantiated correctly. Yet, I do not receive any errors. I checked The Huggingface quicktour for tokenizers, but to no avail. I canāt spot any issues. Does anyone know what may cause my vocabulary to be empty?