GPT-2 Data Preparation for Parsing Trees

IParraMartin · May 6, 2024, 11:14pm

Hi guys,

I am creating a dataset to train/fine-tune GPT-2 with the NanoGPT repository. I have this formatting in a .txt file:

SENTENCE

TREE PARSING

<|endoftext|>

SENTENCE

TREE PARSING

<|endoftext|>

And so on (the special token is added in the tokenization process, \n\n is replaced by it).

Is this correct? Or should I use other formatting for the model to learn how to parse sentences?

Thank you!

Some examples here

Topic		Replies	Views
Can I pass a text file to the tokenizer? Beginners	0	408	July 1, 2022
Can't figure out how to implement gpt2 tokenizer in fine-tuning Beginners	0	331	July 22, 2022
BPE tokenizers and spaces before words 🤗Transformers	4	26855	September 8, 2023
GPT-2 full python tokenizer example for Q/A finetuning Beginners	1	886	December 27, 2022
Question about llama fine tuning dataset token string Beginners	1	17	May 17, 2025