Vocabulary count mismatch when loading the previously created tokenizer

dizzyme09 · January 8, 2024, 10:16am

I am currently fine-tuning the Wav2Vec2 XLS-R model. I have created a tokenizer that is stored in one of the huggingface repos. Here is how I store my tokenizer in huggingface:

from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("./", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
tokenizer.push_to_hub(repo_name)

At a certain time, I want to do the same task using the tokenizer. I load the tokenizer with the following code:

tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(repo_name)

with that code, all the files needed for the tokenizer are loaded, such as vocab.json , tokenizer_config.json , and special_tokens_map.json . All the vocabularies I should have are 30, according to what’s in vocab.json .
But, here’s something strange. when I load the tokenizer, I get two additional tokens, which are already in my vocabulary. From the image is the tokenizer object information that I loaded on the repo.
how did this happen? did I load the tokenizer incorrectly?

Topic		Replies	Views
Improving performance of Wav2Vec2 fine tuning with word piece vocabulary Research	5	2994	October 27, 2021
Wav2vec2CTCTokenizer and vocab.json 🤗Tokenizers	2	1111	October 29, 2022
Facebook/wav2vec2-large-xlsr-53 on the hub: tokenizer issue 🤗Hub	4	4030	March 18, 2022
Can't load tokenizer after fine-tuning Beginners	1	1471	March 1, 2023
Error on creating the Wav2Vec2CTCTokenizer Beginners	0	292	October 6, 2022

Vocabulary count mismatch when loading the previously created tokenizer

Related topics