I’m trying to improve embeddings of the existing model (in my case pre-trained model check-point is “microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext”) for my specific domain (pathology diagnostics) by fine-tuning the pre-trained model with my custom language corpus, thus getting the transfer of knowledge from broad language to the specific language. The process seems to be conceptually straightforward, but I found a lot of evil in the details. Perhaps somebody from the community can point me to the right direction/resources to address these issues?
To transfer learning of word embeddings from broad language to specific language, I need to continue training the word embedding model same way as it was trained before, but with different data set, so I have to figure out how the model was trained before and how to change the dataset.
How to find out that how the model was trained before and how to continue using the same settings:
How can I find out which task model was trained on ? Do I need to explicitly set up the same training task, or is there some way to continue the same training cycle from the model check-point data? Can I retrieve task information from check-point?
How do I find out training params used to train that model? How can I use them to continue training the same way? Can I retrieve Trainer configuration and trainer.args from check-point?
How to change the dataset for the pre-trained model:
Since I need to load my own corpus for training, how do I find out how to prepare csv file for loading into the dataset, so the tokenizer and model Trainer will accept it for training? (column names, one sentence per line or document per line, any cleanup, any pre-tokenization processing expected by model Trainer). Can I retrieve dataset structure or reference from check-point?
Since my language is different, my vocabulary is different too. I can add additional words to the vocabulary of pre-trained model tokenizer [tokenizer.add_tokens(add_vocab)] but should I do something with the model to accept updated vocabulary?
I could not find answers in any fine-tuning references; perhaps someone from the community has more information?