Can I fine tune bert for a project where I have multiple text inputs and one label as output?

Keyu · May 6, 2022, 1:47am

Hello!

I’m working on a project and want to see if this is the right way to use bert.
The training data is two words(A and B ) as variables and one label(0,1,2,3). Basically the similarity between word A and word B can be used to predict this label. So here is what I did:

tokenize the words

tokens = tokenizer( train_texts_balanced['A'], train_texts_balanced['B'].values.tolist(), truncation=True, padding=True, max_length=15)

output: [101, 31624, 11435, 102, 11435, 32033, 102]
2. create torch dataset

class NewsGroupsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels.iloc[idx,0]).double()
        return item

    def __len__(self):
        return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, (train_labels_balanced_new))
val_dataset = NewsGroupsDataset(val_encodings, (val_labels_balanced_new))

load model and train

trainer1 = MyTrainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,          # evaluation dataset
    compute_metrics=compute_metrics    # the callback that computes metrics of interest
)
trainer1.train()

I did have some results but not sure if this is the right way to do it, especially on how I tokenize the input variables. Maybe I should tokenize them separately and feed them as two separate variables in bert model. Can you please advise?

Thanks!

Topic		Replies	Views
Train a Bert Classifier with more than 2 Input Text Columns Beginners	4	1966	October 27, 2023
Multiple sequences per sample 🤗Transformers	1	800	February 17, 2021
How to deal with differences between CoNLL 2003 dataset tokenisation and BER tokeniser when fine tuning NER model? Intermediate	6	2762	November 23, 2021
Token alignment for word-level tasks 🤗Tokenizers	1	2554	August 5, 2020
How to do sequence fine tuning? Beginners	5	759	July 22, 2020

Can I fine tune bert for a project where I have multiple text inputs and one label as output?

Related topics