Multilabel sequence classification with Roberta value error expected input batch size to match target batch size

Trying to tune a multilabel (4 labels) model based on roberta-base. I’ve followed the examples in https://huggingface.co/transformers/custom_datasets.html.

Trying to debug this value error:
Traceback (most recent call last):
trainer.train()
File “transformers/trainer.py”, line 762, in train
tr_loss += self.training_step(model, inputs)
File “transformers/trainer.py”, line 1112, in training_step
loss = self.compute_loss(model, inputs)
File “transformers/trainer.py”, line 1136, in compute_loss
outputs = model(**inputs)
File “torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “transformers/modeling_roberta.py”, line 1015, in forward
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
File “torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “torch/nn/modules/loss.py”, line 916, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File “torch/nn/functional.py”, line 2021, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File “torch/nn/functional.py”, line 1836, in nll_loss
.format(input.size(0), target.size(0)))
ValueError: Expected input batch_size (16) to match target batch_size (64).

I see this in modeling_roberta at the point of error. I looks like the labels for each of the batch results have been flattened into a single tensor, while the batch has the labels separately for each example of the 16. Seems like this might be the cause of the ValueError? but I’m not sure, and don’t know where the labels would have been flattened. Any ideas?
tensor([[ 0.1793, 0.1338, -0.2123, -0.0945],
[ 0.0498, 0.0472, -0.1983, -0.0353],
[ 0.1932, 0.1970, -0.2003, -0.0471],
[ 0.0913, 0.1411, -0.1835, -0.1387],
[ 0.0770, -0.0101, -0.1017, -0.0149],
[ 0.1980, 0.0772, -0.1894, -0.0487],
[ 0.0161, 0.0107, -0.0100, 0.0067],
[ 0.1063, 0.1120, -0.1842, -0.0567],
[ 0.1610, 0.0769, -0.1609, -0.0883],
[ 0.1866, 0.0182, -0.1137, -0.1047],
[ 0.1132, 0.0587, -0.2452, -0.0698],
[ 0.1680, -0.0125, -0.2019, -0.0674],
[-0.0282, 0.1099, -0.1637, -0.1112],
[ 0.1620, 0.1197, -0.2099, 0.0236],
[ 0.1197, 0.1232, -0.2318, -0.0955],
[ 0.3232, 0.1935, -0.3226, -0.0547]], device=‘cuda:0’,
grad_fn=)
labels view
tensor([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0], device=‘cuda:0’)