I have been training a BERT model on a large unsupervised dataset, and now I wish to fine tune the model on a small labelled dataset, but I can’t quite grasp how to do this conceptually, and I’m hoping some of you can help me out.
When doing the unsupervised training/self-training, everything seems fine, and I think I understand it.
In this case, my network is a standard BERT, with a linear layer on top that takes the standard 768 hidden features in BERT down to 30, which is my vocab_size. (I’m training on gene sequences, so basically one sample in my dataset, looks like this:
ASDGDFASGDFSGSDASFASDAUYRES
where I do the standard thing of masking out some of the letters and trying to predict them.
So for standard training, my setup looks like this:
predicted_sequence=bert(input_sequence,masked_input_sequence)
loss = crossentropy(predicted_sequence,masked_input_sequence)
However when I now want to switch to fine-tuning I’m not really sure what to do. In this case my dataset now both consist of a gene sequence and a label sequence:
DFASDGFTHGFDDFSDASFDASF , 00000001111111100000000022222
How do I change my network such that I can now fine-train it to predict these new labels?, do I still use bert(input,masked_input_sequence)? Do I remove the linear layer on top of BERT? or what is the conceptual idea here?