Guidance Needed on Choosing the Right Dataset Format for Transformer Model Training

Hi, Hugging Face community! (or dear @mariosasko again :smiling_face_with_three_hearts:)

I’m currently following this tutorial, where the dataset is created as follows:

from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(

This method is straightforward for text files, but I’m working with a dataset in the Hugging Face .arrow format, created using datasets.Dataset.save_to_disk. I noticed that transformers.TextDataset and transformers.LinebyLineTextDataset don’t seem to support reading from a Hugging Face dataset folder. The source code is here.

Furthermore, when using the Trainer, it seems to require a transformers.Dataset:

from transformers import Trainer

trainer = Trainer(

This is where my confusion lies. transformers.Dataset doesn’t allow reading from an HGF dataset folder and doesn’t seem to enable specifying a column (in my case, column='text'). On the other hand, datasets.Dataset doesn’t allow setting a block_size, which seems crucial for my task, and it’s unclear whether it’s compatible with Trainer.

I’m trying to understand which .Dataset class would be the most appropriate for my scenario :thinking:. Should I use transformers.Dataset, adapting it somehow to read from HGF data folders, or is there a way to use datasets.Dataset with the necessary block_size and ensure compatibility with Trainer?

Any guidance or suggestions on how to approach this would be greatly appreciated!

Thank you in advance!

transformes.LineByLineTextDataset is deprecated, and the deprecation message suggests taking a look at the transformers/examples/pytorch/language-modeling/ at main · huggingface/transformers · GitHub script for the ways to preprocess the data.

So, you can use datasets.load_from_disk to load the dataset and then apply transforms from the linked script to it (.map calls) before passing it to Trainer.

1 Like