How to reduce memory usage for inference while training models from scratch?

I am training BERT models from scratch with my own custom dataset/vocabulary for a scientific domain with original BERT and DistilBERT.

Upon finishing training, I did some experiments to see the memory usage for inference of my models in comparison with SciBERT but SciBERT really outperforms mine especially when the batch_size is higher and when there are more tokens.

DistilBERT actually has fewer layers than SciBERT so I would assume that it would take less memory usage but that is not the case.

Can anyone please advise me how to reduce the memory usage while training a language model?