Fine-Tuning Pre-trained Models Issues and Gotchas

I’m trying to improve embeddings of the existing model (in my case pre-trained model check-point is “microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext”) for my specific domain (pathology diagnostics) by fine-tuning the pre-trained model with my custom language corpus, thus getting the transfer of knowledge from broad language to the specific language. The process seems to be conceptually straightforward, but I found a lot of evil in the details. Perhaps somebody from the community can point me to the right direction/resources to address these issues?

To transfer learning of word embeddings from broad language to specific language, I need to continue training the word embedding model same way as it was trained before, but with different data set, so I have to figure out how the model was trained before and how to change the dataset.

How to find out that how the model was trained before and how to continue using the same settings:

  1. How can I find out which task model was trained on ? Do I need to explicitly set up the same training task, or is there some way to continue the same training cycle from the model check-point data? Can I retrieve task information from check-point?

  2. How do I find out training params used to train that model? How can I use them to continue training the same way? Can I retrieve Trainer configuration and trainer.args from check-point?

How to change the dataset for the pre-trained model:

  1. Since I need to load my own corpus for training, how do I find out how to prepare csv file for loading into the dataset, so the tokenizer and model Trainer will accept it for training? (column names, one sentence per line or document per line, any cleanup, any pre-tokenization processing expected by model Trainer). Can I retrieve dataset structure or reference from check-point?

  2. Since my language is different, my vocabulary is different too. I can add additional words to the vocabulary of pre-trained model tokenizer [tokenizer.add_tokens(add_vocab)] but should I do something with the model to accept updated vocabulary?

I could not find answers in any fine-tuning references; perhaps someone from the community has more information?

Many Thanks!

2 Likes

Hello! Thank you for your question. Regarding the first part of the question, I suggest reading this paper as suggested here and you will find a lot of details about how the model was trained (from parameters to task) or references to other papers where said info is contained. I suggest that you use the checkpoint loading functionality of the framework they used to implement to see if they saved the hyperparameters with the checkpoint or not. For any specific details about training and data pre-processing that are missing from their paper and cannot be inferred from references, it’s always worth contacting the authors!

You will have to explore what learning parameters make sense for your task and data, you don’t need to train with the parameters they trained with. So, in a way, their training configuration does not impact all that much what you do. You just need to find a good set of parameters that work well for your task.

Depending on your task, you might want to use the trainer and the existing data loaders or implement your own. So I can’t give any answers there. If you add your tokens to the vocabulary, I would expect the tokenizer converts them to subword units the model has learned in training. So I don’t expect there is any additional step you need to do in that regard.

1 Like

Thank you for clarification, I guess I was hoping/wishing that pre-trained models have more readily-available information about how they have been trained (trainer.config, params, dataset structure/size, epochs etc.) and we can access this information for fine-tuning purposes instead of looking for papers

1 Like