Fine-tuned BERT tokenizer taking too long to load

I finetuned a pre-trained BERT model on my custom dataset for the LM task, to introduce new vocabularies (around 40k new tokens) from my dataset. Now that I am trying to further finetune the trained model on another classification task, I have been unable to load the pre-trained tokenizer with added vocabulary properly.

  1. I tried loading it up using BERTTokenizer, encoding/tokenizing each sentence using encode_plus takes me 1m 23sec. That’s too much considering I have over 200k sentences for classification just in my training data. I know that I can also use batch_encode_plus with parallelization, but even then, it will take forever to encode just my training data.
  2. I also tried loading it up using BertTokenizerFast and AutoTokenizer, but they take forever to load up.
  3. I tried running the same script with the pre-trained BERT tokenizers without my added tokens, and it takes a fraction of seconds (994 us) to encode the entire batch. So the problem is definitely with my own pre-trained tokenizer, which has the newly added tokens.

Has anyone encountered a similar problem before? While pertaining, I used AutoTokenizer save_pretrained function. When I check the tokenizer after loading it up using BERTTokenizer, I can see all the newly added tokens using the get_vocab() function. So it’s unlikely that something went wrong while saving it.