I am training a smaller RoBERTa model (6layers) with BPE tokenizer on a domain specific corpus (not a large corpus) for MLM and it seems to do a good job in filling in masked words in testing. Also, the loss in Tensorboard looks much better compared to my BERT model (12layers) with wordpiece tokenizer.
When I try to use this small RoBERTa model on fine-tuning for QA (on squad2.0), predictions are mostly always empty. It returns almost always blanks as answer. I followed this https://huggingface.co/blog/how-to-train but QA seems to be not doing great. Any recommendations to make the smaller RoBERTa model to fine-tune on squad? or should I go with bigger models (12-24 layers)? Is there a pruning while training method available in Transformers? I would appreciate any ideas… Thank you!