Loss is "nan" when fine-tuning NLI model (both RoBERTa/BART)

I’m trying to fine-tune ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli on a dataset of around 276.000 hypothesis-premise pairs. I’m following the instructions from the docs here and here. I have the impression that the fine-tuning works (it does the training and saves the checkpoints), but trainer.train() and trainer.evaluate() return “nan” for the loss.

What I’ve tried:

  • I tried using both ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli and facebook/bart-large-mnli to make sure that it’s not linked to specific model, but I get the issue for both models
  • I tried following the advice in this related github issue, but adding num_labels=3 to the config file does not solve the issue. (I think my issue is different because the models are already fine-tuned on NLI in my case)
  • I tried changing the class XDataset(torch.utils.data.Dataset) (which I mostly copied from the docs), because I suspected that there could be an issue with my input data, but I also couldn’t solve it that way.
    => Does anyone know where this issues comes from? See my code below.

Thanks a lot in advance for any suggestion!

Here is my code:

### load model & tokenize
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

max_length = 256
hg_model_hub_name = "ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli"
# also tried: hg_model_hub_name = "facebook/bart-large-mnli"
tokenizer = AutoTokenizer.from_pretrained(hg_model_hub_name)
model = AutoModelForSequenceClassification.from_pretrained(hg_model_hub_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
if device == "cuda":
  model = model.half()

#... some data preprocessing

encodings_train = tokenizer(premise_train, hypothesis_train, return_tensors="pt", max_length=max_length,
                            return_token_type_ids=True, truncation=False, padding=True)
encodings_val = tokenizer(premise_val, hypothesis_val, return_tensors="pt", max_length=max_length,
                          return_token_type_ids=True, truncation=False, padding=True)
encodings_test = tokenizer(premise_test, hypothesis_test, return_tensors="pt", max_length=max_length,
                           return_token_type_ids=True, truncation=False, padding=True)

### create pytorch dataset object
class XDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.as_tensor(val[idx]) for key, val in self.encodings.items()}
        #item = {key: torch.as_tensor(val[idx]).to(device) for key, val in self.encodings.items()}
        item['labels'] = torch.as_tensor(self.labels[idx])
        #item['labels'] = self.labels[idx]
        return item
    def __len__(self):
        return len(self.labels)

dataset_train = XDataset(encodings_train, label_train)
dataset_val = XDataset(encodings_val, label_val)
dataset_test = XDataset(encodings_test, label_test)

## training
from transformers import Trainer, TrainingArguments

# https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=8,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=dataset_train,         # training dataset
    eval_dataset=dataset_val             # evaluation dataset

# output: TrainOutput(global_step=181, training_loss=nan)
# output: {'epoch': 1.0, 'eval_loss': nan}

Update: I spent several hours trying to solve this and I opened a github issue with a detailed description of the issue here: https://github.com/huggingface/transformers/issues/9160