I have trained a tokenizer from scratch, using:
tokenizer.train(files=[pth], vocab_size=52_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
I save the tokenizer, I use it to train a BERT model from scratch, and later I want to test this model using:
unmasker = pipeline(‘fill-mask’, model=model, tokenizer=tokenizer)
But it complains that the tokenizer is unrecogized:
“[…] Should have a model_type
key in its config.json”
How can I save the tokenizer so that there is a model_type indicated in config.json?