I am hoping to confirm my understanding of some definitions in the context of BERT.
(1) Pre-training means running a corpus through the BERT architecture where masked language modeling and next sentence prediction are used to derive weights. You can do this (a) from scratch with your own vocabulary and randomly initialized weights or (b) using the pre-trained BERT vocab/weights (so you are in effect “pre-training a pre-trained model.”
(2) fine tuning means adding a layer to the BERT architecture for some downstream task, such as classification.
Questions
(A) Is there anything incorrect in my understanding above?
(B) Suppose my goal is only to get better embeddings (e.g., for computing cosine similarity between sentences). Would I just want to pre-train the model on my corpus? Is fine tuning also used to get better embeddings - for example, if I fine tune the pretrained BERT model for some classification task, could I use the neurons in the 2nd to last hidden layer to derive sentence embeddings that could later be used to compare cosine similarity between sentences? I currently use the 2nd to last hidden layer of downloaded pretrained BERT models for my sentence embeddings.
I’m trying to understand - if you wanted to do semantic similarity in the future, would you rather derive embeddings from your pre-trained BERT or your pre-trained AND fine tuned BERT?