EncoderDeocoderModel with different checkpoint training

I am trying to train an EncoderDecoderModel using as Encoder “roberta-base” and as decoder “gpt2” on the squad dataset.
I have preprocessed the dataset to obtain input_ids and attention_mask tokenized by the tokenizer of roberta and the labels tokenized by the tokenizer of gpt2, padded with -100.

The following is my training loop.

import torch
import numpy as np
from tqdm.notebook import tqdm

model.to(device)

epochs = 3

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

for epoch in range(epochs):  # loop over the dataset multiple times
   print(f"------------ EPOCH:{epoch+1} ------------")
   # train + evaluate on training data
   val_f1 = 0.0
   losses = []
   k=0
   for i,batch in enumerate(tqdm(train_dataloader)):
      model.train()
      # get the inputs; 
      input_ids = batch["input_ids"].to(device)
      attention_mask = batch["attention_mask"].to(device)
      labels = batch["labels"].to(device)
      #decoder_attention_mask = batch["decoder_attention_mask"].to(device)

      # zero the parameter gradients
      optimizer.zero_grad()

      # forward + backward + optimize
      
      try:
            outputs = model(input_ids=input_ids, 
                            attention_mask=attention_mask, 
                            labels=labels) 
                            #decoder_attention_mask = decoder_attention_mask)
      except:
            k+=1
            continue
      loss = outputs.loss
      losses.append(loss.item())
      #if i % 50 == 0:
      print("\rLoss:", np.mean(losses), end='')
      loss.backward()
      optimizer.step()

   # evaluate (batch generation)
   model.eval()
   print('\nEVALUATING...')
   val_f1 = []
   for eval_batch in tqdm(val_dataloader):
       outputs = model.generate(eval_batch["input_ids"].to(device))
       # compute metrics
       metrics = compute_metrics(pred_ids=outputs, labels_ids=eval_batch["labels"])
       val_f1.append(metrics)
  
   print("\nVal F1:", np.mean(val_f1), "\nN Fails:", k)

It seems to work and the loss goes down for a few batches until it returns the following error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[42], line 20
     18 model.train()
     19 # get the inputs; 
---> 20 input_ids = batch["input_ids"].to(device)
     21 attention_mask = batch["attention_mask"].to(device)
     22 labels = batch["labels"].to(device)

RuntimeError: CUDA error: device-side assert triggered

I had a look around and it seems a problem of 1. Inconsistency between the number of labels/classes and the number of output units or the input of the loss function may be incorrect, but it doesn’t make sense to me as it seems work for a while, and gives me the error when calling input_ids = batch["input_ids"].to(device).

Has someone encountered the same issue??