Dimension issue when fine tuning blenderbot

Hi everyone, I’m new to Deep Learning. I came across the wonderful Transformers library last week and attempting to fine-tune the blenderbot model. I work at a charity that helps people with Cancer and wanted to see if I can use a chatbot on our cancer forum online to answer basic questions for our users.

Just to test if I can get the fine-tuning process to work I’m initially loading a simple dataset from a csv file with a numeric labels and corresponding strings using the load_dataset function e.g.

1, Acute lymphoblastic leukaemia is a type of blood cancer it starts from white blood cells called lymphocytes in the bone marrow.
1, acute lymphoblastic leukaemia usually develops quickly over days or weeks
1, To understand how and why leukaemia affects you as it does, it helps to know how you make blood cells.

I followed the documentation for fine tuning the model from the huggingface course, which is nicely written and seemed fairly straightforward

from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
import torch

checkpoint = "facebook/blenderbot-400M-distill"

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

tokenizer = BlenderbotTokenizer.from_pretrained(checkpoint)
model = BlenderbotForConditionalGeneration.from_pretrained(checkpoint).to(device)

I then created my dataset class and passed in the encodings from the tokenizer step train_encodings = tokenizer(train_texts, truncation=True, padding=True)

import torch

class CancerDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        check_input_ids = item["input_ids"]
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = CancerDataset(train_encodings, train_labels)
val_dataset = CancerDataset(val_encodings, val_labels)

and then tried to train the model

from torch.utils.data import DataLoader
from transformers import AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = BlenderbotForConditionalGeneration.from_pretrained(checkpoint).to(device)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]

but I’m getting an error message saying too many indices for tensor of dimension 1 I suspected this might be an issue with the shape of the labels? (which has a shape of torch.Size([8])) rather than the input_ids which has a shape of torch.Size([8, 88]) I looked online for reshaping a Pytorch tensor and came across the unsqueeze(0) method to add a dimension but when I applied this to the label tensor it didn’t work

IndexError                                Traceback (most recent call last)
<ipython-input-57-42b54a5f7d1d> in <module>
     25         print(f" ---- 3b ----  labels shape {labels.shape}")
---> 27         outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
     28         loss = outputs[0]
     29         loss.backward()

2 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/blenderbot/modeling_blenderbot.py in shift_tokens_right(input_ids, pad_token_id, decoder_start_token_id)
     67     """
     68     shifted_input_ids = input_ids.new_zeros(input_ids.shape)
---> 69     shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
     70     shifted_input_ids[:, 0] = decoder_start_token_id

IndexError: too many indices for tensor of dimension 1

Any advice on where I might fix this would be really appreciated, thank you.

I managed to get around this issue by tokenizing the labels as well rather than passing them in as integers - not sure if there’s a better approach

class CancerDataset(torch.utils.data.Dataset):
    def __init__(self, tokenizer, data):
        tokenizer.pad_token = tokenizer.eos_token
        self.encodings = tokenizer(data['text'], padding=True, truncation=True) 
        with tokenizer.as_target_tokenizer():
            self.targets = tokenizer(data['label'], padding=True, truncation=True)