Adding special tokens to GPT2 for fine-tuning

  • Model: GPT2
  • Pre-train data: general protein (amino acid) sequences
  • Fine-tune data: specific type of proteins with defining properties in categorical form

How to incorporate the defining properties into fine-tuning?

As summarized above, I currently have a GPT2 model that has been pre-trained from scratch using general protein sequences. I want to fine-tune the model with a specific type of proteins in order to generate novel sequences of that type. I also want to incorporate the proteins’ defining properties into the fine-tuning process that would allow me to generate sequences with user-defined properties.

Since the defining properties are in categorical form, is it possible to incorporate them into the data as special tokens for fine-tuning? As far as I know, special tokens are not affected by positional encoding so does it not matter if I place them at the start or end of each sequence?

Thank you in advance to anyone with input!