How to use dataset with run_language_modeling?

kmann9 · April 23, 2021, 10:37pm

I have downloaded the s2orc dataset and saved it to disk.

Since it’s in the arrow format, I cannot figure out how to use the run_language_modeling command, since that seems to require a text file.

It seems like it would be simple. Can anyone help?

kmann9 · April 24, 2021, 4:29pm

I modified run_language_modeling.py a little to make it work.

I would still be interested to hear if this is supported.

Topic		Replies	Views
Data format in run_lm_fine_tuning.py Beginners	2	414	September 8, 2020
Format requirements of dataset when fine tuning another model 🤗Datasets	1	880	April 7, 2022
[NEWBY] Creating custom datasets to fine tune an existing model Beginners	0	300	November 4, 2022
What is the data file format of `run_ner.py`? 🤗Transformers	2	319	April 4, 2024
Sentence Order Prediction - Dataset Creation 🤗Datasets	1	678	October 21, 2021