Data format for BertForSequenceClassification with num_labels > 2

maxpower · March 4, 2021, 5:03pm

Hi,
I have a multilabel task (num_labels=8) and I want to use BertForSequenceClassification using Trainer to train the model.

But I get the following error:

ValueError: Expected input batch_size (8) to match target batch_size (64).

I assume that the problem is the data format of the labels. Currently, my label is a 8-dim list (e.g., [1,0,0,0,0,1,0,0]).

What is the right format for the label data?

Here my code:

class EmotionDataset(torch.utils.data.Dataset):
def init(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    item['labels'] = torch.tensor(self.labels[idx])
    return item

def __len__(self):
    return len(self.labels)
MODEL_NAME = ‘dbmdz/bert-base-german-uncased’

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = BertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=8)

tokenize data

dataset_train = Dataset.from_pandas(df_train)
train_encodings = tokenizer(dataset_train['text], truncation=True, padding=True)
train_dataset = EmotionDataset(train_encodings, dataset_train['label])

training_args = TrainingArguments(
output_dir=‘./results’, # output directory
num_train_epochs=1, # total # of training epochs
per_device_train_batch_size=8, # batch size per device during training
per_device_eval_batch_size=32, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir=‘./logs’, # directory for storing logs
)

trainer = Trainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=test_dataset # evaluation dataset
)

_ = trainer.train()
trainer.evaluate()

Thanks,
Max

lewtun · March 4, 2021, 5:29pm

Hi @maxpower, I think the format of your dataset is fine but I think you have to change the model’s loss function to use a sigmoid instead of a softmax on the logits (i.e. BCEWithLogitsLoss). You can see a skeleton + hacky Colab in this thread: Fine-Tune for MultiClass or MultiLabel-MultiClass - #8 by lewtun

maxpower · March 5, 2021, 9:46am

Perfect, it works. Thanks so much!

lewtun · March 5, 2021, 10:13am

FYI I just posted a more elegant solution in the thread that just subclasses Trainer and overrides the compute_loss function (you can see it in action in the Colab notebook too )

opey · July 21, 2021, 5:01pm

Hi lewtun,
Thanks for your help so far.
But I’m having issues getting it to work for multiclass classification.
The custommetric in the notebook only works for multilabel classification.
Is there anything I need to do please?

lewtun · August 2, 2021, 6:14pm

hey @opey for ordinary multiclass classification you can follow the official tutorial here or just run one of the scripts in the examples here

hope that helps!

Topic		Replies	Views
BertForSequenceClassification - ValueError: Target size (torch.Size([32])) must be the same as input size (torch.Size([32, 35]))) Intermediate	0	621	July 11, 2023
Multiclass Classification: "labels" format Beginners	0	671	October 26, 2022
Dataset label format for multi-label text classification 🤗Datasets	9	13295	February 9, 2023
BERT Multiclass Sequence Classification Index Error Beginners	4	974	April 13, 2021
Expected input batch_size (2048) to match target batch_size (4) Beginners	3	1603	May 23, 2022

Data format for BertForSequenceClassification with num_labels > 2

tokenize data

Related topics