I am training a simple binary classification model using Hugging face models using pytorch.
Bert PyTorch HuggingFace.
keyword_lst
has 20k
new token which I add to tokenizer.
I take mean of old tokenizer to update new tokenizers.
I am training this model for 4,00,000 data points.
Here is the code:
tok_orig = tr.RobertaTokenizer.from_pretrained("../models/unitary_roberta/tokenizer")
tokenizer = tr.RobertaTokenizer.from_pretrained("../models/unitary_roberta/tokenizer")
tokenizer.add_tokens(keyword_lst)
# do tokenization
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512, return_tensors="pt")
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=512, return_tensors="pt")
# make datasets
train_data = HateDataset(train_encodings, train_labels)
val_data = HateDataset(val_encodings, val_labels)
# load model
model = tr.RobertaForSequenceClassification.from_pretrained("../models/unitary_roberta/model",
num_labels=2)
# add embedding params for new vocab words
model.resize_token_embeddings(len(tokenizer))
weights = model.roberta.embeddings.word_embeddings.weight
# initialize new embedding weights as mean of original tokens
with torch.no_grad():
emb = []
for i in range(len(keyword_lst)):
word = keyword_lst[i]
# first & last tokens are just string start/end; don't keep
tok_ids = tok_orig(word)["input_ids"][1:-1]
tok_weights = weights[tok_ids]
# average over tokens in original tokenization
weight_mean = torch.mean(tok_weights, axis=0)
emb.append(weight_mean)
weights[-len(keyword_lst):,:] = torch.vstack(emb).requires_grad_()
The tokenizer keeps on running.