I have trained a tokenizer from scratch, using:
tokenizer.train(files=[pth], vocab_size=52_000, min_frequency=2, special_tokens=[ "<s>", "<pad>", "</s>", "<unk>", "<mask>", ])
I save the tokenizer, I use it to train a BERT model from scratch, and later I want to test this model using:
unmasker = pipeline(‘fill-mask’, model=model, tokenizer=tokenizer)
But it complains that the tokenizer is unrecogized:
“[…] Should have a
model_type key in its config.json”
How can I save the tokenizer so that there is a model_type indicated in config.json?