- Model: GPT2
- Pre-train data: general protein (amino acid) sequences
- Fine-tune data: specific type of proteins with defining properties in categorical form
How to incorporate the defining properties into fine-tuning?
As summarized above, I currently have a GPT2 model that has been pre-trained from scratch using general protein sequences. I want to fine-tune the model with a specific type of proteins in order to generate novel sequences of that type. I also want to incorporate the proteins’ defining properties into the fine-tuning process that would allow me to generate sequences with user-defined properties.
Since the defining properties are in categorical form, is it possible to incorporate them into the data as special tokens for fine-tuning? As far as I know, special tokens are not affected by positional encoding so does it not matter if I place them at the start or end of each sequence?
Thank you in advance to anyone with input!