Challenges Achieving Satisfactory Accuracy in Fine-Tuning RoBERTa on a Custom Masked Token Prediction Dataset

N2oorr · March 4, 2024, 4:33pm

I am attempting to fine-tune RoBERTa on a custom dataset designed for masked token prediction. The dataset consists of highly domain-specific vocabulary, totaling 127 unique tokens. In its entirety, the dataset comprises 73,000 tokens distributed across 2,500 lines.

To preprocess the data, I utilized a custom tokenizer and conducted a grid search for hyperparameter optimization. Despite these efforts, the model’s performance consistently remains between 50% and 60% accuracy on both the training and testing sets.

I am perplexed by the inability to achieve satisfactory accuracy, even on the training set itself. Could anyone offer insights into potential reasons for this issue?

P.S.: I’ve experimented with alternative transformer models, but encountered the same challenge.

latestgbapps · March 4, 2024, 4:56pm

It seems like you’re facing challenges with fine-tuning RoBERTa on a custom dataset, despite using a custom tokenizer and conducting hyperparameter optimization. The model’s accuracy remains consistently low (between 50% and 60%) on both training and testing sets. Have you considered checking the quality and diversity of your dataset, ensuring it captures enough variation and complexity to effectively train the model?

N2oorr · March 4, 2024, 5:12pm

Yes, I’ve dedicated a lot of effort to enriching my dataset as much as possible. I’ve made sure to include a wide variety of token sequences to capture the complexity of the data.
Despite these efforts, it’s worth noting that around 15% of the vocabulary is quite repetitive due to the specific nature of the domain.

Topic		Replies	Views
RoBERTa fine-tuning on a dataset of short sentences and low cardinality 🤗Transformers	0	731	December 4, 2023
[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task 🤗Transformers	6	216	March 19, 2024
Costumizing MASKed tokens 🤗Transformers	1	243	September 27, 2023
RoBERTa MLM fine-tuning Beginners	1	1873	November 24, 2021
Fine tune a saved model with custom tokenizer 🤗Transformers	3	2964	December 15, 2020

Challenges Achieving Satisfactory Accuracy in Fine-Tuning RoBERTa on a Custom Masked Token Prediction Dataset

Related topics