Pretraining T5 from scratch using MLM

Hello everyone, I am new to pretraining language models. I have a query regarding pretraining T5 using MLM. I used around 700 thousands tokens (around 20000 rows from the wikipedia.simple dataset) to pretrain t5-small for MLM but my loss never goes down below 4.5. I trained for 15 epochs. The inference is also not good when I try to predict a masked token. Can someone please help me with what the issue is here? It would mean alot.

My code is provided below

Most of the code has been picked from transformers example of pretraining