I am trying to fine-tune an XLM-RoBERTa model for masked language modeling on a dataset of 9000 lemmatized sentences. I am using the XLMRobertaForMaskedLM
class from the Hugging Face library and training the model with a batch size of 8. However, even after training for 3 epochs, the model is not performing well on the training data. I have tried training the model on a single sentence for 50 epochs, and while the loss decreases to 10^-10, the model still doesn’t predict the masked tokens correctly. I’m a beginner. Please help me understand why the model cannot learn to clearly predict even a single sentence after 50 training epochs