I finetuned a pre-trained BERT model on my custom dataset for the LM task, to introduce new vocabularies (around 40k new tokens) from my dataset. Now that I am trying to further finetune the trained model on another classification task, I have been unable to load the pre-trained tokenizer with added vocabulary properly.
- I tried loading it up using BERTTokenizer, encoding/tokenizing each sentence using encode_plus takes me 1m 23sec. That’s too much considering I have over 200k sentences for classification just in my training data. I know that I can also use batch_encode_plus with parallelization, but even then, it will take forever to encode just my training data.
- I also tried loading it up using BertTokenizerFast and AutoTokenizer, but they take forever to load up.
- I tried running the same script with the pre-trained BERT tokenizers without my added tokens, and it takes a fraction of seconds (994 us) to encode the entire batch. So the problem is definitely with my own pre-trained tokenizer, which has the newly added tokens.
Has anyone encountered a similar problem before? While pertaining, I used AutoTokenizer save_pretrained function. When I check the tokenizer after loading it up using BERTTokenizer, I can see all the newly added tokens using the get_vocab() function. So it’s unlikely that something went wrong while saving it.