When we use Trainer
to build a language model with MLM, based on which model we use (suppose DistilBERT), do we use the pre-trained weights in Trainer
or weights are supposed to be updated from scrach?
You can do either – it depends on how you create your model. Trainer just handles the training aspect, not the model initialization.
# Model randomly initialized (starting from scratch)
config = AutoConfig.for_model("distilbert")
# Update config if you'd like
# config.update({"param": value})
model = AutoModelForMaskedLM.from_config(config)
# Model from a pre-trained checkpoint
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")
# Put model in Trainer
trainer = Trainer(model=model)
Unless you have a huge amount of data that is very different than what pre-trained models were trained on, I wouldn’t recommend starting from scratch.
Start from scratch when you are creating a model for a niche domain like a low-resource language.
Start from a pre-trained model if your text is in a high-resource language (like English) but the jargon might be very specific (like scientific texts). There are enough fundamental similarities that you’ll save compute and time by starting from a pre-trained model.
True, I was going to do Sentiment Analysis over some text data, but whatever model that I tested, it over-fitted and I did not get any good result on validation data. So, I decided to train a DistilBERT model based on my own data, but I do not know whether the model start training with pre-trained weights or by random weights from scratch.
Thanks.