Hello!
I’m working on a project and want to see if this is the right way to use bert.
The training data is two words(A and B ) as variables and one label(0,1,2,3). Basically the similarity between word A and word B can be used to predict this label. So here is what I did:
- tokenize the words
tokens = tokenizer( train_texts_balanced['A'], train_texts_balanced['B'].values.tolist(), truncation=True, padding=True, max_length=15)
output: [101, 31624, 11435, 102, 11435, 32033, 102]
2. create torch dataset
class NewsGroupsDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
item["labels"] = torch.tensor(self.labels.iloc[idx,0]).double()
return item
def __len__(self):
return len(self.labels)
# convert our tokenized data into a torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, (train_labels_balanced_new))
val_dataset = NewsGroupsDataset(val_encodings, (val_labels_balanced_new))
- load model and train
trainer1 = MyTrainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset, # evaluation dataset
compute_metrics=compute_metrics # the callback that computes metrics of interest
)
trainer1.train()
I did have some results but not sure if this is the right way to do it, especially on how I tokenize the input variables. Maybe I should tokenize them separately and feed them as two separate variables in bert model. Can you please advise?
Thanks!