Pretraining T5 from scratch using MLM

Hello everyone, I am new to pretraining language models. I have a query regarding pretraining T5 using MLM. I used around 700 thousands tokens (around 20000 rows from the wikipedia.simple dataset) to pretrain t5-small for MLM but my loss never goes down below 4.5. I trained for 15 epochs. The inference is also not good when I try to predict a masked token. Can someone please help me with what the issue is here? It would mean alot.

My code is provided below
Pretrain_t5

Most of the code has been picked from transformers example of pretraining