Multilabel sequence classification with Roberta value error expected input batch size to match target batch size

Trying to tune a multilabel (4 labels) model based on roberta-base. I’ve followed the examples in https://huggingface.co/transformers/custom_datasets.html.

Trying to debug this value error:
Traceback (most recent call last):
trainer.train()
File “transformers/trainer.py”, line 762, in train
tr_loss += self.training_step(model, inputs)
File “transformers/trainer.py”, line 1112, in training_step
loss = self.compute_loss(model, inputs)
File “transformers/trainer.py”, line 1136, in compute_loss
outputs = model(**inputs)
File “torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “transformers/modeling_roberta.py”, line 1015, in forward
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
File “torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “torch/nn/modules/loss.py”, line 916, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File “torch/nn/functional.py”, line 2021, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File “torch/nn/functional.py”, line 1836, in nll_loss
.format(input.size(0), target.size(0)))
ValueError: Expected input batch_size (16) to match target batch_size (64).

I see this in modeling_roberta at the point of error. I looks like the labels for each of the batch results have been flattened into a single tensor, while the batch has the labels separately for each example of the 16. Seems like this might be the cause of the ValueError? but I’m not sure, and don’t know where the labels would have been flattened. Any ideas?
tensor([[ 0.1793, 0.1338, -0.2123, -0.0945],
[ 0.0498, 0.0472, -0.1983, -0.0353],
[ 0.1932, 0.1970, -0.2003, -0.0471],
[ 0.0913, 0.1411, -0.1835, -0.1387],
[ 0.0770, -0.0101, -0.1017, -0.0149],
[ 0.1980, 0.0772, -0.1894, -0.0487],
[ 0.0161, 0.0107, -0.0100, 0.0067],
[ 0.1063, 0.1120, -0.1842, -0.0567],
[ 0.1610, 0.0769, -0.1609, -0.0883],
[ 0.1866, 0.0182, -0.1137, -0.1047],
[ 0.1132, 0.0587, -0.2452, -0.0698],
[ 0.1680, -0.0125, -0.2019, -0.0674],
[-0.0282, 0.1099, -0.1637, -0.1112],
[ 0.1620, 0.1197, -0.2099, 0.0236],
[ 0.1197, 0.1232, -0.2318, -0.0955],
[ 0.3232, 0.1935, -0.3226, -0.0547]], device=‘cuda:0’,
grad_fn=)
labels view
tensor([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0], device=‘cuda:0’)

I had the same problem. The problem lies in nll_loss. For multilabel problems BCEWithLogitsLoss is the most common I think. You can subclass Trainer and overwrite the compute_loss function in your custom trainer to make things work. This worked for me:


    class CustomTrainer(Trainer):
        def compute_loss(self, model, inputs, return_outputs=False):
            outputs = model(
                input_ids=inputs['input_ids'],
                attention_mask=inputs['attention_mask'],
                token_type_ids=inputs['token_type_ids']
            )
            loss = th.nn.BCEWithLogitsLoss()(outputs['logits'],
                                             inputs['labels'])
            return (loss, outputs) if return_outputs else loss
1 Like