Data format in

Hello everyone, I would like to ask for help with the following:

I want to fine-tuning of a language model for text generation, for this I will use

I would like to know what is the optimal way to train the model. Can it be trained with a pure csv file, or should the file have some kind of pre-processing?

Or is it necessary to train it with an instance of the Dataset class ?

Thanks for the help

Hi @kintaro, is now renamed to , you can fine it here.

The data file can be in one of the two formats

  1. Line by line: which means each example is on it’s own line seperated by \n. Set the --line_by_line command line argument
  2. Or just a plain text file from which examples will be sampled .

This notebook also walks through how a language model can be trained.

@valhalla Thanks a lot for your answer. Let me ask:

1.With line_by_line is there any limit in the max number of tokens per line? maybe 512?

it seems like the easy way is with the plain text file.