Loading BPE modeled Tokenizer results in empty tokenizer

timdadum · April 15, 2024, 12:18pm

I am creating a GPT-like model from scratch to improve my understanding of tokenization, encoding and the architecture in itself. As such, I tried to create my own, fairly naïve BPE encoding using for-loops, but as you could guess, it was terribly slow for large vocabularies with many merges. I decided to use the Huggingface Tokenizer class from the tokenizers library with the following function to create a tokenizer:

def create_tokenizer(corpus_path, tokenizer_path, vocab_size):
    # Initialize a tokenizer with BPE model
    tokenizer = Tokenizer(BPE(unk_token='<UNK>'))
    tokenizer.pre_tokenizer = Whitespace()

    # Initialize the trainer
    trainer = BpeTrainer(vocab_size=vocab_size, special_tokens=['<UNK>'])

    # List of files to train on
    files = [corpus_path]

    # Train the tokenizer
    tokenizer.train(files, trainer)

    # Saving the tokenizer
    tokenizer.save(tokenizer_path)
    print(f"Succesfully saved tokenizer at {tokenizer_path}")
    return tokenizer

Which actually works well to create a tokenizer. The resulting .json file also looks good:

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
    {
      "id": 0,
(......)
    "vocab": {
      "<UNK>": 0,
      "'": 1,
      ",": 2,
      "-": 3,
      ".": 4,
      "_": 5,
      "a": 6,
      "b": 7,
      "c": 8,
      "d": 9,
      "e": 10,
      "f": 11,
      "g": 12,
      "h": 13,
      "i": 14,

Then, when I try to load it using the following method:

def load_model(model_class, config):
    model = model_class(**config['Hyperparameters'])
    tokenizer = Tokenizer(BPE(unk_token='<UNK>'))
    tokenizer.from_file(config['Files']['tokenizer'])
    
    # DEBUG
    print("Loaded tokenizer from file:", config['Files']['tokenizer'])
    vocab = tokenizer.get_vocab()
    print("Vocabulary size:", len(vocab))
    print("Sample entries from vocabulary:", dict(list(vocab.items())[:10]))
    # GUBED

    model.set_tokenizer(tokenizer, config)
    model.load_state_dict(torch.load(config['Files']['model'], map_location=config['Hyperparameters']['device']))
    model.eval()
    return model

trained_model = load_model(gpt.GPT, config)

Loaded tokenizer from file: tiny-llm/tokenizers/nature.json
Vocabulary size: 0
Sample entries from vocabulary: {}
Tokenizer succesfully set

Yields an empty vocabulary, as-if the tokenizer is not instantiated correctly. Yet, I do not receive any errors. I checked The Huggingface quicktour for tokenizers, but to no avail. I can’t spot any issues. Does anyone know what may cause my vocabulary to be empty?

Topic		Replies	Views
Does the ByteLevelBPETokenizer need to be wrapped in a normal Tokenizer? 🤗Tokenizers	0	1731	March 18, 2023
Tokenizer Saving Issues, Wrapper Issues and Push to Hub issues Beginners	3	1341	May 12, 2024
Trained a tokenizer from scratch but problem when loading 🤗Transformers	0	462	October 8, 2023
Load tokenizer from file : Exception: data did not match any variant of untagged enum ModelWrapper 🤗Tokenizers	3	8292	August 1, 2023
How to create a hugging face compatible tokenizer from a vocab file? Beginners	0	178	May 23, 2024

Loading BPE modeled Tokenizer results in empty tokenizer

Related topics