Guidance Needed on Choosing the Right Dataset Format for Transformer Model Training

mariosasko · December 8, 2023, 3:32pm

transformes.LineByLineTextDataset is deprecated, and the deprecation message suggests taking a look at the transformers/examples/pytorch/language-modeling/run_mlm.py at main · huggingface/transformers · GitHub script for the ways to preprocess the data.

So, you can use datasets.load_from_disk to load the dataset and then apply transforms from the linked script to it (.map calls) before passing it to Trainer.

Topic		Replies	Views
Using huggingface transformers trainer method for hugging face datasets 🤗Datasets	1	1089	April 15, 2024
How to use load_dataset to load my own local dataset? 🤗Datasets	1	891	May 24, 2023
Huggingface datasets convert a dataset to pandas and then convert it back Beginners	5	41298	May 6, 2022
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12684	October 6, 2021
Use tf.data.Data with HuggingFace datasets 🤗Transformers	2	2637	March 22, 2021

Guidance Needed on Choosing the Right Dataset Format for Transformer Model Training

Related topics