Target size (torch.Size([8])) must be the same as input size (torch.Size([8, 2]))

Hello all,

I am aware of all the solutions which are discussed here previously regarding the same problem but still I had no luck with those solutions.

I’m trying to implement a binary classifier. I’m using is a customized dataset and having one text column with german text data and the label column has two classes either 0 or 1.

I’m using here the deepset/gbert-base model and number of labels as 2.
I have followed the official tutorial of hugging face A full training - Hugging Face NLP Course
I’m getting everything similar till the step:

outputs = model(**batch)

I have tried the following work arounds suggested in this forum and other coding forums. Which are mentioned below:

  1. I checked the pytorch version(Suggested by online forums : to update the pytorch version which are below verison 2) and I’m using the following:

2.0.0+cu118

  1. The labels are of the float type and does not contain any null value (Suggested by online forums : to check if the data type of labels is float as the model expect it in that format)

  2. Also tried to change the label shape from [0] and [1] to [1,0] for class zero and [0,1] for class 1 because the error says the input from the model to the loss function is of size [16,2] and the target size which are labels here are of size [16] . But changing the shape from [0] and [1] to [1,0] for class zero and [0,1] for class 1 also did not solve the problem.

  3. I tried to implement through Trainer API following the official tutorial of hugging face Fine-tuning a model with the Trainer API - Hugging Face NLP Course and tried to customize the loss function from binary_cross_entropy_with_logits to nn.CrossEntropyLoss() . Just tried to change the loss function to see if the code runs but ended up with the same error.

Code:

> from transformers import AutoTokenizer, DataCollatorWithPadding
> tokenizer = AutoTokenizer.from_pretrained("deepset/gbert-base")
> 
> def tokenize_function(examples):
>     return tokenizer(examples["text1"], truncation=True)
> 
> tokenized_datasets = final_dataset_dict.map(tokenize_function, batched=True)
> data_collator= DataCollatorWithPadding(tokenizer)
> tokenized_datasets = tokenized_datasets.remove_columns(["text1"])
> tokenized_datasets["train"].column_names
> tokenized_datasets.set_format("torch")
> 
> from torch.utils.data import DataLoader
> 
> train_dataloader = DataLoader(tokenized_datasets["train"], shuffle = True, batch_size = 8, collate_fn = data_collator)
> eval_dataloader = DataLoader(tokenized_datasets["unsupervised"], batch_size = 8, collate_fn = data_collator)
> 
> for batch in train_dataloader:
>   break
> print({k: v.shape for k, v in batch.items()})
> #print(batch)
> 
> from transformers import AutoModelForSequenceClassification
> checkpoint = "deepset/gbert-base"
> model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels =2)
> 
> outputs = model(**batch)
> print(outputs.loss, outputs.logits.shape)

After tokenization my data looks like this :

> DatasetDict({
>     train: Dataset({
>         features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
>         num_rows: 2512
>     })
>     test: Dataset({
>         features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
>         num_rows: 1255
>     })
>     validation: Dataset({
>         features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
>         num_rows: 1255
>     })
> })

The batch items in the train_dataloader looks like this.

{'labels': torch.Size([8]), 'input_ids': torch.Size([8, 69]), 'token_type_ids': torch.Size([8, 69]), 'attention_mask': torch.Size([8, 69])}

The detailed error is as follows:

> ---------------------------------------------------------------------------
> ValueError                                Traceback (most recent call last)
> <ipython-input-36-b84c8f6552ab> in <cell line: 1>()
> ----> 1 outputs = model(**batch)
>       2 #print(outputs.shape)
>       3 print(outputs.loss, outputs.logits.shape)
> 
> 4 frames
> /usr/local/lib/python3.9/dist-packages/torch/nn/functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
>    3161 
>    3162     if not (target.size() == input.size()):
> -> 3163         raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
>    3164 
>    3165     return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
> 
> ValueError: Target size (torch.Size([8])) must be the same as input size (torch.Size([8, 2]))

Any lead from this problem will be very much appreciated. :pray: :nerd_face:

Can anybody please give any idea why such error is happening ?
I also tried other German BERT model But getting the same error.

  1. nlptown/bert-base-multilingual-uncased-sentiment
  2. papluca/xlm-roberta-base-language-detection
  3. oliverguhr/german-sentiment-bert

Any lead will be a big help.

Changing the label datatype to integer solved the problem.

df[‘labels’] = df[‘labels’].astype(int)

5 Likes

Thank you very much for the solution! I was getting the same error and spent 2 hours debugging, and finally found a fix, thank u! Changing the label datatype did work for me :smiley:

2 Likes

Thanks for sharing this ! It also solved my issue ! I just had to convert the labels from float to int !

What if the labels are already typed as int64
I have the same error for sequence classification with BERT model, even my labels start from 0, here’s what my training dataset looks like

Dataset({
    features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 393
})