Challenges Achieving Satisfactory Accuracy in Fine-Tuning RoBERTa on a Custom Masked Token Prediction Dataset

I am attempting to fine-tune RoBERTa on a custom dataset designed for masked token prediction. The dataset consists of highly domain-specific vocabulary, totaling 127 unique tokens. In its entirety, the dataset comprises 73,000 tokens distributed across 2,500 lines.

To preprocess the data, I utilized a custom tokenizer and conducted a grid search for hyperparameter optimization. Despite these efforts, the model’s performance consistently remains between 50% and 60% accuracy on both the training and testing sets.

I am perplexed by the inability to achieve satisfactory accuracy, even on the training set itself. Could anyone offer insights into potential reasons for this issue?

P.S.: I’ve experimented with alternative transformer models, but encountered the same challenge.

It seems like you’re facing challenges with fine-tuning RoBERTa on a custom dataset, despite using a custom tokenizer and conducting hyperparameter optimization. The model’s accuracy remains consistently low (between 50% and 60%) on both training and testing sets. Have you considered checking the quality and diversity of your dataset, ensuring it captures enough variation and complexity to effectively train the model?

Yes, I’ve dedicated a lot of effort to enriching my dataset as much as possible. I’ve made sure to include a wide variety of token sequences to capture the complexity of the data.
Despite these efforts, it’s worth noting that around 15% of the vocabulary is quite repetitive due to the specific nature of the domain.