Load custom pretrained tokenizer

jbmaxwell · October 28, 2021, 1:36am

I’m trying to run BigBird on my dataset but I’m hitting an error trying to load my custom/saved tokenizer.
I train the tokenizer using:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
paths = ['./content/test.txt', './content/train.txt']
from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=paths, trainer=trainer)

# Save files to disk
tokenizer.model.save("./tokenizer/bigbird_full")
tokenizer.save("./tokenizer/bigbird_full/config.json")

Then I try to run run_mlm.py with:

os.system(
    f"python run_mlm.py \
       --model_type 'big_bird' \
       --config_name './models/bigbird_full/config.json' \
       --tokenizer_name './tokenizer/bigbird_full' \
       --train_file './content/train.txt' \
       --output_dir './{out_dir}' \
       --do_train \
       --num_train_epochs 1 \
       --overwrite_output_dir"
)

It fails with the error:

OSError: Can't load tokenizer for './tokenizer/bigbird_full'. Make sure that:

- './tokenizer/bigbird_full' is a correct model identifier listed on 'https://huggingface.co/models'
  (make sure './tokenizer/bigbird_full' is not a path to a local directory with something else, in that case)

- or './tokenizer/bigbird_full' is the correct path to a directory containing relevant tokenizer files

- or 'main' is a valid git identifier (branch name, a tag name, or a commit id) that exists for this model name as listed on its model page on 'https://huggingface.co/models'

My tokenizer’s directory contains: config.json, merges.txt, vocab.json

What is the error trying to tell me? Or rather, what are “relevant tokenizer files” if not these?

PS – It’s worth noting that, for some reason, I have to manually add "model_type":"big_bird" to my saved config.json, otherwise I get an error telling me basically exactly that… (i.e., it needs that key/val).

Topic		Replies	Views
Why BigBirdTokenizer can’t load my own vocab or trained BPE results？ Beginners	2	2781	September 3, 2021
Using whitespace tokenizer for training models 🤗Tokenizers	1	3189	June 6, 2021
Customized tokenization files in run_clm script 🤗Tokenizers	3	695	August 18, 2022
Training RoBERTa from scratch: error? 🤗Transformers	0	585	August 26, 2021
Trained a tokenizer from scratch but problem when loading 🤗Transformers	0	478	October 8, 2023

Load custom pretrained tokenizer

Related topics