I’m trying to run BigBird on my dataset but I’m hitting an error trying to load my custom/saved tokenizer.
I train the tokenizer using:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
paths = ['./content/test.txt', './content/train.txt']
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=paths, trainer=trainer)
# Save files to disk
tokenizer.model.save("./tokenizer/bigbird_full")
tokenizer.save("./tokenizer/bigbird_full/config.json")
Then I try to run run_mlm.py
with:
os.system(
f"python run_mlm.py \
--model_type 'big_bird' \
--config_name './models/bigbird_full/config.json' \
--tokenizer_name './tokenizer/bigbird_full' \
--train_file './content/train.txt' \
--output_dir './{out_dir}' \
--do_train \
--num_train_epochs 1 \
--overwrite_output_dir"
)
It fails with the error:
OSError: Can't load tokenizer for './tokenizer/bigbird_full'. Make sure that:
- './tokenizer/bigbird_full' is a correct model identifier listed on 'https://huggingface.co/models'
(make sure './tokenizer/bigbird_full' is not a path to a local directory with something else, in that case)
- or './tokenizer/bigbird_full' is the correct path to a directory containing relevant tokenizer files
- or 'main' is a valid git identifier (branch name, a tag name, or a commit id) that exists for this model name as listed on its model page on 'https://huggingface.co/models'
My tokenizer’s directory contains: config.json
, merges.txt
, vocab.json
What is the error trying to tell me? Or rather, what are “relevant tokenizer files” if not these?
PS – It’s worth noting that, for some reason, I have to manually add "model_type":"big_bird"
to my saved config.json
, otherwise I get an error telling me basically exactly that… (i.e., it needs that key/val).