ModernBERT MaskedLM nan training loss

yzimmermann · January 2, 2025, 9:08am

I have been trying to run pre-training on a fineweb subset with ModernBERT…

First, I tokenize my dataset:

hf_tokenizer = PreTrainedTokenizerFast.from_pretrained("answerdotai/ModernBERT-base")

def tokenize_function(examples):
    return hf_tokenizer(examples["text"],truncation=True)

tokenized_dataset = ds_select.map(
    tokenize_function,
    batched=True, 
    batch_size=1000, 
)

Then, I initialize a ModernBERT model:

bert_config = ModernBertConfig(
    global_rope_theta=10000,
    pad_token_id=hf_tokenizer.pad_token_id,
    bos_token_id=hf_tokenizer.bos_token_id,
    eos_token_id=hf_tokenizer.eos_token_id,
    cls_token_id=hf_tokenizer.cls_token_id,
    sep_token_id=hf_tokenizer.sep_token_id,
)
model = ModernBertForMaskedLM(bert_config)

I set up a DataCollator with the recommended mlm_probability:

data_collator = DataCollatorForLanguageModeling(
    tokenizer=hf_tokenizer, mlm=True, mlm_probability=0.3
)

and start the training:

trainer = LoggingTrainer(
    model=model,
    args=training_args,
    train_dataset=split_datasets["train"].shuffle(),
    eval_dataset=split_datasets["test"].shuffle(),
    data_collator=data_collator,
    processing_class=hf_tokenizer,
)
trainer.train()

Right on the first example I get a nan loss:

Loss:  tensor([10.8572,     nan], device='cuda:0', grad_fn=<GatherBackward>)
Faulty inputs detected:
input_ids: tensor([[50281,   510,  6146,  ...,  7355, 50284, 50282],
        [50281,   510, 34461,  ..., 50283, 50283, 50283]], device='cuda:0')
attention_mask: tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')
labels: tensor([[-100, -100, -100,  ..., -100,   15, -100],
        [-100, -100, -100,  ..., -100, -100, -100]], device='cuda:0')
Loss:  tensor([nan, nan], device='cuda:0', grad_fn=<GatherBackward>)
Faulty inputs detected:
input_ids: tensor([[50281, 25897,    13,  ..., 50283, 50283, 50283],
        [50281,   510,   941,  ..., 50284,    15, 50282]], device='cuda:0')
attention_mask: tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0')
labels: tensor([[-100, -100, -100,  ..., -100, -100, -100],
        [-100, -100, -100,  ..., 2774, -100, -100]], device='cuda:0')

Notice how the labels don’t seem to be aligned (50284 vs. 15)? What am I doing wrong here? I have done pretraining with other models using the transformers library and haven’t run into this kind of problem before. I played around with different optimizer parameters but got the same outcome. I would be thankful for any guidance.

John6666 · January 2, 2025, 10:09am

There seems to be a phenomenon where NaN Loss occurs with fp16, but it is unclear whether this is related to the issue.

github.com/huggingface/transformers

Trainer bug? Loss and logits are “nan” when fine-tuning NLI model (both RoBERTa/BART)

opened 11:54PM - 16 Dec 20 UTC

closed 02:30PM - 17 Dec 20 UTC

MoritzLaurer

## Environment info - `transformers` version: 4.0.1 (also reproduced …the same issue with 3.5.1) - Platform: Google Colab - Python version: 3.6 - PyTorch version (GPU?): 1.7.0+cu101 - Tensorflow version (GPU?): no - Using GPU in script?: Yes - Using distributed or parallel set-up in script?: (don't know. Probably not) ### Who can help @sgugger ## Information The problem arises when using: * [ x] my own modified scripts: (give details below) The tasks I am working on is: * [ x] my own task or dataset: (give details below) **Description:** I’m trying to fine-tune a pre-trained NLI model (`ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli`) on a dataset of around 276.000 hypothesis-premise pairs. I’m following the instructions from the docs [here](https://huggingface.co/transformers/custom_datasets.html) and [here](https://huggingface.co/transformers/training.html). When I run the training, it seems like the fine-tuning works (it does the training and saves the checkpoints), but `trainer.train()` and `trainer.evaluate()` return "nan" as loss value. **What I've tried:** - I tried using both `ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli` and `facebook/bart-large-mnli` to make sure that it's not linked to specific model, but I get the issue for both models - I tried following the advice in this [related github issue](https://github.com/huggingface/transformers/issues/1727), but adding `num_labels=3` to the config file does not solve the issue. (I think my issue is different because the models are already fine-tuned on NLI in my case) - I tried many different ways of changing my input data because I suspected that there could be an issue with my input data, but I also couldn't solve it that way. - **The probable source of the issue:** I inspected the prediction output from the model during training and the weird thing is that the prediction value always seems to be "0" (entailment) in 100% of cases (see printed output at the bottom of the code below). This cannot be right. Even weirder: When I first run the model to predict a test sequences before running the trainer, I get normal logits as output. When I run the exact same code block again at the end after having run the trainer, I get `tensor([[nan, nan, nan]]` as output (see code below). - I suspect that the source for the 'only 0 prediction output' is that the logits the model returns during training are possibly always `torch.tensor([[np.nan, np.nan, np.nan]])`. `torch.tensor([[np.nan, np.nan, np.nan]]).argmax(-1)` returns torch.tensor(0) without triggering an error. The big mystery for me is why the logits would become "nan", because the model does not do that when I use the same input data only outside of the trainer, but something changes once I've run the trainer. => I would be very thankful for any help on this! (I've been trying to solve this since two days now) Thanks a lot in advance. ## To reproduce ### Here is my code: ### load model & tokenize from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch max_length = 256 hg_model_hub_name = "ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli" # also tried: hg_model_hub_name = "facebook/bart-large-mnli" tokenizer = AutoTokenizer.from_pretrained(hg_model_hub_name) model = AutoModelForSequenceClassification.from_pretrained(hg_model_hub_name) model.config device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Device: {device}") if device == "cuda": model = model.half() model.to(device) model.train(); **Running a test inference with the model at this point works fine:** ``` test_enc = tokenizer(nli_train[0]["premise"], nli_train[0]["hypothesis"], return_tensors="pt", max_length=max_length, return_token_type_ids=True, truncation=True, padding=True) model.eval(); test_output_loss = model(test_enc["input_ids"].to(device), attention_mask=test_enc["attention_mask"].to(device), token_type_ids=test_enc["token_type_ids"].to(device), labels=torch.tensor(2).to(device)) print(test_output_loss) #output: SequenceClassifierOutput(loss=tensor(2.2168, device='cuda:0', dtype=torch.float16, grad_fn=<NllLossBackward>), logits=tensor([[ 0.4075, 0.8511, -0.7549]], device='cuda:0', dtype=torch.float16, grad_fn=<AddmmBackward>), hidden_states=None, attentions=None) ``` **Then I continue with preprocessing and training:** #... some data preprocessing encodings_train = tokenizer(premise_train, hypothesis_train, return_tensors="pt", max_length=max_length, return_token_type_ids=True, truncation=False, padding=True) encodings_val = tokenizer(premise_val, hypothesis_val, return_tensors="pt", max_length=max_length, return_token_type_ids=True, truncation=False, padding=True) encodings_test = tokenizer(premise_test, hypothesis_test, return_tensors="pt", max_length=max_length, return_token_type_ids=True, truncation=False, padding=True) ### create pytorch dataset object class XDataset(torch.utils.data.Dataset): def __init__(self, encodings, labels): self.encodings = encodings self.labels = labels def __getitem__(self, idx): item = {key: torch.as_tensor(val[idx]) for key, val in self.encodings.items()} #item = {key: torch.as_tensor(val[idx]).to(device) for key, val in self.encodings.items()} item['labels'] = torch.as_tensor(self.labels[idx]) #item['labels'] = self.labels[idx] return item def __len__(self): return len(self.labels) dataset_train = XDataset(encodings_train, label_train) dataset_val = XDataset(encodings_val, label_val) dataset_test = XDataset(encodings_test, label_test) # compute metrics with trainer from sklearn.metrics import accuracy_score, precision_recall_fscore_support def compute_metrics(pred): labels = pred.label_ids print(labels) preds = pred.predictions.argmax(-1) print(preds) precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary', pos_label=0) acc = accuracy_score(labels, preds) return { 'accuracy': acc, 'f1': f1, 'precision': precision, 'recall': recall } ## training from transformers import Trainer, TrainingArguments # https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments training_args = TrainingArguments( output_dir='./results', # output directory num_train_epochs=1, # total number of training epochs per_device_train_batch_size=8, # batch size per device during training per_device_eval_batch_size=8, # batch size for evaluation warmup_steps=500, # number of warmup steps for learning rate scheduler weight_decay=0.01, # strength of weight decay logging_dir='./logs', # directory for storing logs logging_steps=100, ) trainer = Trainer( model=model, # the instantiated 🤗 Transformers model to be trained args=training_args, # training arguments, defined above train_dataset=dataset_train, # training dataset eval_dataset=dataset_val # evaluation dataset ) trainer.train() # output: TrainOutput(global_step=181, training_loss=nan) trainer.evaluate() # output: [2 2 2 0 0 2 2 2 0 2 0 0 2 2 2 2 0 2 0 2 2 2 2 0 2 0 2 0 0 2 0 0 2 0 0 0 2 0 2 0 0 0 0 0 2 0 0 2 2 2 0 2 2 2 2 2 0 0 0 0 2 0 0 0 2 2 0 0 0 2 0 0 0 2 2 0 2 0 0 2 2 2 0 2 2 0 0 0 0 0 0 0 2 0 0 0 0 2 0 2 2 0 2 0 0 2 2 2 2 2 2 2 0 0 0 0 2 0 0 2 0 0 0 0 2 2 2 0 0 0 0 0 2 0 0 2 0 2 0 2 0 2 0 0 2 2 0 0 2 2 2 2 2 2 0 0 2 2 2 2 0 2 0 0 2 2 2 0 0 2 0 2 0 2 0 0 0 0 0 0 2 0 0 2 2 0 2 2 2 0 2 2 0 2 2 2 2 2 2 0 0 2 0 0 2 2 0 0 0 2 0 2 2 2 0 0 0 0 0 0 0 0 2 0 2 2 2 0 2 0 0 2 0 2 2 0 0 0 0 2 2 2 0 0 0 2 2 2 2 0 2 0 2 2 2] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] {'epoch': 1.0, 'eval_accuracy': 0.5137254901960784, 'eval_f1': 0.6787564766839378, 'eval_loss': nan, 'eval_precision': 0.5137254901960784, 'eval_recall': 1.0} **Test running the model again after training, returns `tensor([[nan, nan, nan]]` for some reason:** ``` test_enc = tokenizer(nli_train[0]["premise"], nli_train[0]["hypothesis"], return_tensors="pt", max_length=max_length, return_token_type_ids=True, truncation=True, padding=True) model.eval(); test_output_loss = model(test_enc["input_ids"].to(device), attention_mask=test_enc["attention_mask"].to(device), token_type_ids=test_enc["token_type_ids"].to(device), labels=torch.tensor(2).to(device)) print(test_output_loss) #output: SequenceClassifierOutput(loss=tensor(nan, device='cuda:0', dtype=torch.float16, grad_fn=<NllLossBackward>), logits=tensor([[nan, nan, nan]], device='cuda:0', dtype=torch.float16, grad_fn=<AddmmBackward>), hidden_states=None, attentions=None) ``` [1]: https://huggingface.co/ ## Expected behavior Model should not return "nan" for logits and return a loss value.

yzimmermann · January 2, 2025, 10:41am

Thanks for the note, but I’m running this with fp32 right now.

dhruvgrammarly · January 3, 2025, 5:00pm

What does token 50283 correspond to? I don’t know for sure but maybe the padding stuff isn’t working as expected? Having 3 instances of 50283 in a row looks suspicious.

akhooli · January 5, 2025, 8:41am

50283 is [PAD]. Not sure if this implementation is complete (collator to support dynamic padding, global/local attention etc.). Still, haven’t seen a successful pretraining script.

dhruvgrammarly · January 6, 2025, 6:29am

Try manually extracting samples from your dataset, and detokenizing them using the tokenizer and inspecting each (token, string, label) tuple and seeing if it matches what you expect. If you can identify the faulty inputs you’ll have something to go on.

akhooli · January 6, 2025, 12:21pm

I used a similar approach and trained a tiny model but also trained my own tokenizer. It completed training successfully. Only other diff I used the Trainer class directly.

kendallpark · January 27, 2025, 4:16pm

I am also getting this error. What is weird is that training works on MPS but CUDA yields NaNs during cross entropy loss.

Edit: resolved my issue: ModernBertModel works on the CPU but fails on the GPU · Issue #174 · AnswerDotAI/ModernBERT · GitHub

Topic		Replies	Views
Training ModernBert+GPT2 Beginners	4	263	January 16, 2025
Training Loss 0.0000 and Validation Loss nan Intermediate	2	144	March 12, 2025
ModernBERT Pretraining using HuggingFace API Models	3	252	March 17, 2025
The result of fine tuning will be loss =0,eval_loss=Nan. How can I start learning the right way? Beginners	0	1043	September 10, 2023
Couple of questions about Trainer Beginners	0	329	June 13, 2023

ModernBERT MaskedLM nan training loss

Related topics