Sentence Order Prediction - Dataset Creation

Hi all,

I noticed that the LineByLineWithSOPTextDataset is slated for deprecation, with the recommendation to use the Datasets library. Scanning the documentation of Datasets, I couldn’t find a drop-in replacement.

Does such a replacement exist? If no replacement exists at the moment, is there a timeline till deprecation and a replacement on the horizon? Thanks so much!

Warm regards,
Sahil

Hi ! This class is deprecated in favor of using

  • the :hugs:Datasets library to load the dataset
  • the Dataset.map function to apply the tokenization

You can check out the run_mlm.py script to see how it loads a text dataset and then applies tokenization.

It should be possible to use the LineByLineWithSOPTextDataet.create_examples_from_document in Dataset.map to get examples in the SOP format.