S3 inputs for finetuning GPT2 on SageMaker

Hello,

I want to finetune GPT2-XL with my own dataset on SageMaker. There are about 50 millions of entries. Each entry is a text that could be assimilated to a tweet/midjourney prompt (between 50-100 words, sentence and/or keywords). I have a big .txt file with each entry per line. The goal is to be able to generate new random entries.

When I want to create a training job on SageMaker it asks a S3 bucket where there is my input data. But I have no idea what format should I use ? Is it ok to put the raw text file with one entry per line ? Should I convert it in json with an object per line ? Should I put with the beginning of the text, and the full text in one object ? I’m a bit lost with this.

Is there any documentation about the format that I could use for text generation ?

Regards,

So what format should I use ?

  1. file1.txt
I didn't today... Must mean I need to take another trip!
Did you know that suicide is the second leading cause of death among teens 10-24
  1. file1.json
{"inputs": "I didn't today... Must mean I need to take another trip!"}
{"inputs": "Did you know that suicide is the second leading cause of death among teens 10-24"}
  1. file2.json
{"part": "I didn't today...", "full": "I didn't today... Must mean I need to take another trip!"}
{"part": "Did you know", "full": "Did you know that suicide is the second leading cause of death among teens 10-24"}

I think it really depends on your implementation right ?
There is open source github repo called nanoGPT that uses the .txt format. So you can use your text file as it is. GitHub - karpathy/nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs.
Also, what instance you are using on Sagemaker to train this ?