Data format in run_lm_fine_tuning.py

kintaro · September 7, 2020, 1:57am

Hello everyone, I would like to ask for help with the following:

I want to fine-tuning of a language model for text generation, for this I will use run_lm_fine_tuning.py

I would like to know what is the optimal way to train the model. Can it be trained with a pure csv file, or should the file have some kind of pre-processing?

Or is it necessary to train it with an instance of the Dataset class ?

Thanks for the help

valhalla · September 7, 2020, 4:24am

Hi @kintaro,

run_lm_fine_tuning.py is now renamed to run_language_modeling.py , you can fine it here.

The data file can be in one of the two formats

Line by line: which means each example is on it’s own line seperated by \n. Set the --line_by_line command line argument
Or just a plain text file from which examples will be sampled .

This notebook also walks through how a language model can be trained.

kintaro · September 8, 2020, 6:31pm

@valhalla Thanks a lot for your answer. Let me ask:

1.With line_by_line is there any limit in the max number of tokens per line? maybe 512?

it seems like the easy way is with the plain text file.

Topic		Replies	Views
Script run_mlm.py line by line 🤗Transformers	1	688	January 24, 2022
"run_lm_finetuning.py" was replaced? Beginners	5	4661	June 1, 2021
How to train a language model from scratch when my dataset is bigger than RAM? Beginners	19	9760	September 18, 2020
DialoGPT fine-tuning dataset format Models	3	729	April 27, 2021
Training: "'Trainer' object has no attribute 'epoch'" 🤗Transformers	0	1047	November 3, 2020

Data format in run_lm_fine_tuning.py

Related topics