Guidance Needed on Choosing the Right Dataset Format for Transformer Model Training

transformes.LineByLineTextDataset is deprecated, and the deprecation message suggests taking a look at the transformers/examples/pytorch/language-modeling/run_mlm.py at main 路 huggingface/transformers 路 GitHub script for the ways to preprocess the data.

So, you can use datasets.load_from_disk to load the dataset and then apply transforms from the linked script to it (.map calls) before passing it to Trainer.

1 Like