Hi guys,
I am creating a dataset to train/fine-tune GPT-2 with the NanoGPT repository. I have this formatting in a .txt file:
SENTENCE
TREE PARSING
<|endoftext|>
SENTENCE
TREE PARSING
<|endoftext|>
And so on (the special token is added in the tokenization process, \n\n is replaced by it).
Is this correct? Or should I use other formatting for the model to learn how to parse sentences?
Thank you!
Some examples here