Can I fine tune bert for a project where I have multiple text inputs and one label as output?

Hello!

I’m working on a project and want to see if this is the right way to use bert.
The training data is two words(A and B ) as variables and one label(0,1,2,3). Basically the similarity between word A and word B can be used to predict this label. So here is what I did:

  1. tokenize the words
tokens = tokenizer( train_texts_balanced['A'], train_texts_balanced['B'].values.tolist(), truncation=True, padding=True, max_length=15)

output: [101, 31624, 11435, 102, 11435, 32033, 102]
2. create torch dataset

class NewsGroupsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels.iloc[idx,0]).double()
        return item

    def __len__(self):
        return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, (train_labels_balanced_new))
val_dataset = NewsGroupsDataset(val_encodings, (val_labels_balanced_new))
  1. load model and train
trainer1 = MyTrainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,          # evaluation dataset
    compute_metrics=compute_metrics    # the callback that computes metrics of interest
)
trainer1.train()

I did have some results but not sure if this is the right way to do it, especially on how I tokenize the input variables. Maybe I should tokenize them separately and feed them as two separate variables in bert model. Can you please advise?

Thanks!

1 Like