How to do sequence fine tuning?

I have been training a BERT model on a large unsupervised dataset, and now I wish to fine tune the model on a small labelled dataset, but I can’t quite grasp how to do this conceptually, and I’m hoping some of you can help me out.

When doing the unsupervised training/self-training, everything seems fine, and I think I understand it.
In this case, my network is a standard BERT, with a linear layer on top that takes the standard 768 hidden features in BERT down to 30, which is my vocab_size. (I’m training on gene sequences, so basically one sample in my dataset, looks like this:


where I do the standard thing of masking out some of the letters and trying to predict them.
So for standard training, my setup looks like this:

    loss = crossentropy(predicted_sequence,masked_input_sequence)

However when I now want to switch to fine-tuning I’m not really sure what to do. In this case my dataset now both consist of a gene sequence and a label sequence:

DFASDGFTHGFDDFSDASFDASF , 00000001111111100000000022222

How do I change my network such that I can now fine-train it to predict these new labels?, do I still use bert(input,masked_input_sequence)? Do I remove the linear layer on top of BERT? or what is the conceptual idea here?

I found this explanation by one of the original authors great to get a conceptual understanding.


If I understand correctly you need token classification. Example here:

You can assign a label to each token you have. I think you will need to consider each “letter” a token.


Thank you for the link, that was the kind of conceptual talk I was looking for.

1 Like

The examples there are a bit over my head for now, but I got some ideas for how to proceed now, and if I later get stuck again I might try to look at them more closely to see how they do it.

Found this talk from Alex Graves also to be complementary, he also speaks about transformers more generally with advantages/disadvantages.

The Q/A at the end of these videos I found to be helpful.

1 Like