Hello all,
I am aware of all the solutions which are discussed here previously regarding the same problem but still I had no luck with those solutions.
I’m trying to implement a binary classifier. I’m using is a customized dataset and having one text column with german text data and the label column has two classes either 0 or 1.
I’m using here the deepset/gbert-base model and number of labels as 2.
I have followed the official tutorial of hugging face A full training - Hugging Face NLP Course
I’m getting everything similar till the step:
outputs = model(**batch)
I have tried the following work arounds suggested in this forum and other coding forums. Which are mentioned below:
- I checked the pytorch version(Suggested by online forums : to update the pytorch version which are below verison 2) and I’m using the following:
2.0.0+cu118
-
The labels are of the float type and does not contain any null value (Suggested by online forums : to check if the data type of labels is float as the model expect it in that format)
-
Also tried to change the label shape from [0] and [1] to [1,0] for class zero and [0,1] for class 1 because the error says the input from the model to the loss function is of size [16,2] and the target size which are labels here are of size [16] . But changing the shape from [0] and [1] to [1,0] for class zero and [0,1] for class 1 also did not solve the problem.
-
I tried to implement through Trainer API following the official tutorial of hugging face Fine-tuning a model with the Trainer API - Hugging Face NLP Course and tried to customize the loss function from binary_cross_entropy_with_logits to nn.CrossEntropyLoss() . Just tried to change the loss function to see if the code runs but ended up with the same error.
Code:
> from transformers import AutoTokenizer, DataCollatorWithPadding
> tokenizer = AutoTokenizer.from_pretrained("deepset/gbert-base")
>
> def tokenize_function(examples):
> return tokenizer(examples["text1"], truncation=True)
>
> tokenized_datasets = final_dataset_dict.map(tokenize_function, batched=True)
> data_collator= DataCollatorWithPadding(tokenizer)
> tokenized_datasets = tokenized_datasets.remove_columns(["text1"])
> tokenized_datasets["train"].column_names
> tokenized_datasets.set_format("torch")
>
> from torch.utils.data import DataLoader
>
> train_dataloader = DataLoader(tokenized_datasets["train"], shuffle = True, batch_size = 8, collate_fn = data_collator)
> eval_dataloader = DataLoader(tokenized_datasets["unsupervised"], batch_size = 8, collate_fn = data_collator)
>
> for batch in train_dataloader:
> break
> print({k: v.shape for k, v in batch.items()})
> #print(batch)
>
> from transformers import AutoModelForSequenceClassification
> checkpoint = "deepset/gbert-base"
> model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels =2)
>
> outputs = model(**batch)
> print(outputs.loss, outputs.logits.shape)
After tokenization my data looks like this :
> DatasetDict({
> train: Dataset({
> features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
> num_rows: 2512
> })
> test: Dataset({
> features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
> num_rows: 1255
> })
> validation: Dataset({
> features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
> num_rows: 1255
> })
> })
The batch items in the train_dataloader
looks like this.
{'labels': torch.Size([8]), 'input_ids': torch.Size([8, 69]), 'token_type_ids': torch.Size([8, 69]), 'attention_mask': torch.Size([8, 69])}
The detailed error is as follows:
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
> <ipython-input-36-b84c8f6552ab> in <cell line: 1>()
> ----> 1 outputs = model(**batch)
> 2 #print(outputs.shape)
> 3 print(outputs.loss, outputs.logits.shape)
>
> 4 frames
> /usr/local/lib/python3.9/dist-packages/torch/nn/functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
> 3161
> 3162 if not (target.size() == input.size()):
> -> 3163 raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
> 3164
> 3165 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
>
> ValueError: Target size (torch.Size([8])) must be the same as input size (torch.Size([8, 2]))
Any lead from this problem will be very much appreciated.