I am currently fine-tuning the Wav2Vec2 XLS-R model. I have created a tokenizer that is stored in one of the huggingface repos. Here is how I store my tokenizer in huggingface:
from transformers import Wav2Vec2CTCTokenizer
tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("./", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
tokenizer.push_to_hub(repo_name)
At a certain time, I want to do the same task using the tokenizer. I load the tokenizer with the following code:
tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(repo_name)
with that code, all the files needed for the tokenizer are loaded, such as vocab.json
, tokenizer_config.json
, and special_tokens_map.json
. All the vocabularies I should have are 30, according to what’s in vocab.json
.
But, here’s something strange. when I load the tokenizer, I get two additional tokens, which are already in my vocabulary. From the image is the tokenizer object information that I loaded on the repo.
how did this happen? did I load the tokenizer incorrectly?