Smaller RoBERTa model

zb1 · July 9, 2020, 9:25pm

Hello,
I am training a smaller RoBERTa model (6layers) with BPE tokenizer on a domain specific corpus (not a large corpus) for MLM and it seems to do a good job in filling in masked words in testing. Also, the loss in Tensorboard looks much better compared to my BERT model (12layers) with wordpiece tokenizer.

When I try to use this small RoBERTa model on fine-tuning for QA (on squad2.0), predictions are mostly always empty. It returns almost always blanks as answer. I followed this https://huggingface.co/blog/how-to-train but QA seems to be not doing great. Any recommendations to make the smaller RoBERTa model to fine-tune on squad? or should I go with bigger models (12-24 layers)? Is there a pruning while training method available in Transformers? I would appreciate any ideas… Thank you!

valhalla · July 10, 2020, 5:22am

hi @zb1 the official examples include Roberta distillation using the DistillBert method. You can find it here
https://github.com/huggingface/transformers/tree/master/examples/distillation

Topic		Replies	Views
[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task 🤗Transformers	6	216	March 19, 2024
Pretraining RoBERTa from scratch breaks down when using tokenizer with smaller vocabulary Beginners	2	1677	March 7, 2021
RoBERTa MLM fine-tuning Beginners	1	1873	November 24, 2021
Domain adaptation of Language Model and Tokenizer Beginners	8	2867	June 17, 2024
Domain adaptation for embeddings - fine tuning on MLM Beginners	2	489	July 12, 2024

Smaller RoBERTa model

Related topics