I have been trying to run pre-training on a fineweb subset with ModernBERT…
First, I tokenize my dataset:
hf_tokenizer = PreTrainedTokenizerFast.from_pretrained("answerdotai/ModernBERT-base")
def tokenize_function(examples):
return hf_tokenizer(examples["text"],truncation=True)
tokenized_dataset = ds_select.map(
tokenize_function,
batched=True,
batch_size=1000,
)
Then, I initialize a ModernBERT model:
bert_config = ModernBertConfig(
global_rope_theta=10000,
pad_token_id=hf_tokenizer.pad_token_id,
bos_token_id=hf_tokenizer.bos_token_id,
eos_token_id=hf_tokenizer.eos_token_id,
cls_token_id=hf_tokenizer.cls_token_id,
sep_token_id=hf_tokenizer.sep_token_id,
)
model = ModernBertForMaskedLM(bert_config)
I set up a DataCollator with the recommended mlm_probability
:
data_collator = DataCollatorForLanguageModeling(
tokenizer=hf_tokenizer, mlm=True, mlm_probability=0.3
)
and start the training:
trainer = LoggingTrainer(
model=model,
args=training_args,
train_dataset=split_datasets["train"].shuffle(),
eval_dataset=split_datasets["test"].shuffle(),
data_collator=data_collator,
processing_class=hf_tokenizer,
)
trainer.train()
Right on the first example I get a nan loss:
Loss: tensor([10.8572, nan], device='cuda:0', grad_fn=<GatherBackward>)
Faulty inputs detected:
input_ids: tensor([[50281, 510, 6146, ..., 7355, 50284, 50282],
[50281, 510, 34461, ..., 50283, 50283, 50283]], device='cuda:0')
attention_mask: tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 0, 0, 0]], device='cuda:0')
labels: tensor([[-100, -100, -100, ..., -100, 15, -100],
[-100, -100, -100, ..., -100, -100, -100]], device='cuda:0')
Loss: tensor([nan, nan], device='cuda:0', grad_fn=<GatherBackward>)
Faulty inputs detected:
input_ids: tensor([[50281, 25897, 13, ..., 50283, 50283, 50283],
[50281, 510, 941, ..., 50284, 15, 50282]], device='cuda:0')
attention_mask: tensor([[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 1, 1, 1]], device='cuda:0')
labels: tensor([[-100, -100, -100, ..., -100, -100, -100],
[-100, -100, -100, ..., 2774, -100, -100]], device='cuda:0')
Notice how the labels don’t seem to be aligned (50284
vs. 15
)? What am I doing wrong here? I have done pretraining with other models using the transformers library and haven’t run into this kind of problem before. I played around with different optimizer parameters but got the same outcome. I would be thankful for any guidance.