Training Chinese and English language modeling MLM bert model
–num_train_epochs 500
–per_device_train_batch_size 8
–learning_rate 1e-4
–warmup_steps 5000
–max_seq_length 512 \
Hey! This is probably too late. But this is most likely due to a high learning rate. You could reduce the learning rate and/or use gradient clipping. The latter will not have a significant change in performance, but will prevent a bad mini-batch from messing up the training of your model. Good luck!