After vocabulary extension the tokenizer keeps on running

I am training a simple binary classification model using Hugging face models using pytorch.

Bert PyTorch HuggingFace.

keyword_lst has 20k new token which I add to tokenizer.

I take mean of old tokenizer to update new tokenizers.

I am training this model for 4,00,000 data points.

Here is the code:

tok_orig = tr.RobertaTokenizer.from_pretrained("../models/unitary_roberta/tokenizer")
tokenizer = tr.RobertaTokenizer.from_pretrained("../models/unitary_roberta/tokenizer")
tokenizer.add_tokens(keyword_lst)


# do tokenization
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512, return_tensors="pt")
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=512, return_tensors="pt")
    
# make datasets
train_data = HateDataset(train_encodings, train_labels)
val_data = HateDataset(val_encodings, val_labels)
    
# load model
model = tr.RobertaForSequenceClassification.from_pretrained("../models/unitary_roberta/model", 
                                                             num_labels=2)
# add embedding params for new vocab words
model.resize_token_embeddings(len(tokenizer))
weights = model.roberta.embeddings.word_embeddings.weight
    
# initialize new embedding weights as mean of original tokens
    with torch.no_grad():
        emb = []
        for i in range(len(keyword_lst)):
            word = keyword_lst[i]
            # first & last tokens are just string start/end; don't keep
            tok_ids = tok_orig(word)["input_ids"][1:-1]
            tok_weights = weights[tok_ids]

            # average over tokens in original tokenization
            weight_mean = torch.mean(tok_weights, axis=0)
            emb.append(weight_mean)
        weights[-len(keyword_lst):,:] = torch.vstack(emb).requires_grad_()

The tokenizer keeps on running. :frowning: