BertForMaskedLM training from scratch not converging

Hello,

I am a researcher at ETHZ trying to use BertForMaskedLM for chemistry. This has already been done, and is called “molecular transformer”.

Unfortunately, the training does not seem to converge. Every epoch training only lasts for a few seconds, although the same training takes approx. 24h with the same hyperparameters using SimpleTransformers Here and Here.

What I am doing right now:

my eval and train datasets are already tokenized using the encoder_plus method of a slightly modified BERT Tokenizer:

>> eval_dataset.__getitem__(0)
{'input_ids': tensor([ 3, 25,  7, ...       0]), 'token_type_ids': tensor([0, 0, 0, 0, ...       0]), 'attention_mask': tensor([1, 1, 1, 1, ...       0])}
from transformers import (
    BertConfig,
    BertForMaskedLM,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer
)
data_collator=DataCollatorForLanguageModeling(tokenizer=self.tokenizer, mlm=True, mlm_probability=self.mask_prob)



training_args = TrainingArguments(
            output_dir="output",
            evaluation_strategy="epoch",
            num_train_epochs=50,
            report_to="wandb",
            learning_rate=0.00005,
        )

bert_config = BertConfig(
            vocab_size=120, # This should be correct for our use
            num_attention_heads=4,
            hidden_size=256,
            intermediate_size=512,
        )

model = BertForMaskedLM(bert_config)

trainer = SmilesTrainer(
            model=model,
            args=training_args,
            data_collator=data_collator,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,)
trainer.train()

Resulting eval metrics:

Does anybody know why it does not converge?

Best,

G. Sulpizio.