I want to finetune the GPT model on my own data for causal language modeling. I currently use the script provided in the examples directory on the repository:
My question is about the preprocessing of the data. I suppose that I need to indicate somehow what a sequence is. I understand how this can be done for GPT-2 but there does not seem to be a ‘[SEP]’ token for GPT. Would it be sufficient to just add this token to the vocabulary? Or did I miss something?