ModernBERT MaskedLM nan training loss

I have been trying to run pre-training on a fineweb subset with ModernBERT…

First, I tokenize my dataset:

hf_tokenizer = PreTrainedTokenizerFast.from_pretrained("answerdotai/ModernBERT-base")

def tokenize_function(examples):
    return hf_tokenizer(examples["text"],truncation=True)

tokenized_dataset = ds_select.map(
    tokenize_function,
    batched=True, 
    batch_size=1000, 
)

Then, I initialize a ModernBERT model:

bert_config = ModernBertConfig(
    global_rope_theta=10000,
    pad_token_id=hf_tokenizer.pad_token_id,
    bos_token_id=hf_tokenizer.bos_token_id,
    eos_token_id=hf_tokenizer.eos_token_id,
    cls_token_id=hf_tokenizer.cls_token_id,
    sep_token_id=hf_tokenizer.sep_token_id,
)
model = ModernBertForMaskedLM(bert_config)

I set up a DataCollator with the recommended mlm_probability:

data_collator = DataCollatorForLanguageModeling(
    tokenizer=hf_tokenizer, mlm=True, mlm_probability=0.3
)

and start the training:

trainer = LoggingTrainer(
    model=model,
    args=training_args,
    train_dataset=split_datasets["train"].shuffle(),
    eval_dataset=split_datasets["test"].shuffle(),
    data_collator=data_collator,
    processing_class=hf_tokenizer,
)
trainer.train()

Right on the first example I get a nan loss:

Loss:  tensor([10.8572,     nan], device='cuda:0', grad_fn=<GatherBackward>)
Faulty inputs detected:
input_ids: tensor([[50281,   510,  6146,  ...,  7355, 50284, 50282],
        [50281,   510, 34461,  ..., 50283, 50283, 50283]], device='cuda:0')
attention_mask: tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')
labels: tensor([[-100, -100, -100,  ..., -100,   15, -100],
        [-100, -100, -100,  ..., -100, -100, -100]], device='cuda:0')
Loss:  tensor([nan, nan], device='cuda:0', grad_fn=<GatherBackward>)
Faulty inputs detected:
input_ids: tensor([[50281, 25897,    13,  ..., 50283, 50283, 50283],
        [50281,   510,   941,  ..., 50284,    15, 50282]], device='cuda:0')
attention_mask: tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0')
labels: tensor([[-100, -100, -100,  ..., -100, -100, -100],
        [-100, -100, -100,  ..., 2774, -100, -100]], device='cuda:0')

Notice how the labels don’t seem to be aligned (50284 vs. 15)? What am I doing wrong here? I have done pretraining with other models using the transformers library and haven’t run into this kind of problem before. I played around with different optimizer parameters but got the same outcome. I would be thankful for any guidance.

1 Like

There seems to be a phenomenon where NaN Loss occurs with fp16, but it is unclear whether this is related to the issue.

Thanks for the note, but I’m running this with fp32 right now.

1 Like

What does token 50283 correspond to? I don’t know for sure but maybe the padding stuff isn’t working as expected? Having 3 instances of 50283 in a row looks suspicious.

1 Like

50283 is [PAD]. Not sure if this implementation is complete (collator to support dynamic padding, global/local attention etc.). Still, haven’t seen a successful pretraining script.

1 Like

Try manually extracting samples from your dataset, and detokenizing them using the tokenizer and inspecting each (token, string, label) tuple and seeing if it matches what you expect. If you can identify the faulty inputs you’ll have something to go on.

1 Like

I used a similar approach and trained a tiny model but also trained my own tokenizer. It completed training successfully. Only other diff I used the Trainer class directly.

1 Like