Sentence Order Prediction - Dataset Creation

schopra · October 8, 2021, 2:17am

Hi all,

I noticed that the LineByLineWithSOPTextDataset is slated for deprecation, with the recommendation to use the Datasets library. Scanning the documentation of Datasets, I couldn’t find a drop-in replacement.

Does such a replacement exist? If no replacement exists at the moment, is there a timeline till deprecation and a replacement on the horizon? Thanks so much!

Warm regards,
Sahil

lhoestq · October 21, 2021, 9:19am

Hi ! This class is deprecated in favor of using

the Datasets library to load the dataset
the Dataset.map function to apply the tokenization

You can check out the run_mlm.py script to see how it loads a text dataset and then applies tokenization.

It should be possible to use the LineByLineWithSOPTextDataet.create_examples_from_document in Dataset.map to get examples in the SOP format.

Topic		Replies	Views
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12709	October 6, 2021
Nlp 0.3.0 is out! 🤗Datasets	3	838	July 8, 2020
Memory Efficient Dataset Creation for NSP Training Beginners	1	391	December 7, 2021
How did the dataset manages long sentences? 🤗Datasets	1	984	February 15, 2022
How to get model output to retain \n from dataset? Beginners	0	291	July 29, 2022

Sentence Order Prediction - Dataset Creation

Related topics