GPT-2 Data Preparation for Parsing Trees

Hi guys,

I am creating a dataset to train/fine-tune GPT-2 with the NanoGPT repository. I have this formatting in a .txt file:

SENTENCE

TREE PARSING

<|endoftext|>

SENTENCE

TREE PARSING

<|endoftext|>

And so on (the special token is added in the tokenization process, \n\n is replaced by it).

Is this correct? Or should I use other formatting for the model to learn how to parse sentences?

Thank you!:blush:


Some examples here