Type of dataset in Trainer class

suyash21 · July 17, 2020, 8:15pm

Hi, I was going through the documentation and got a confusion

trainer = Trainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=test_dataset # evaluation dataset
)

I couldn’t understand what is the type of train_dataset and how the target for loss calculation is selected.
In Fine-tuning in native TensorFlow 2 also there is no target value. Am I missing something?
model.fit(train_dataset, epochs=2, steps_per_epoch=115)

Thank you

mikaelsouza · July 18, 2020, 4:50am

For more context, he/she is talking about this page: https://huggingface.co/transformers/training.html

I also got confused by this bit of the documentation, but I think this code expects datasets like the ones provided by Hugging Face’s NLP package.

I think they are all based on Pytorch’s Dataset Class, but I could be mistaken.

Try to use one of the datasets provided by their NLP package and check if it works correctly.

Hope this helps!

valhalla · July 18, 2020, 9:23am

Hi @suyash21 this post has some explanation about the dataset expected by Trainer

sgugger · July 20, 2020, 1:48pm

Trainer is to be used with PyTorch, so in this case the train_dataset needs to be a PyTorch dataset. TFTrainer would expect a TF dataset. The doc page is a bit unclear (types are right in the signature but too short/wrong in the enumeration). I’ll send a fix to this later today.

Topic		Replies	Views
Dataset expected by Trainer Beginners	5	8994	September 28, 2020
Use tf.data.Data with HuggingFace datasets 🤗Transformers	2	2638	March 22, 2021
Quick Tour: "Train using Tensorflow" gives `Dataset argument should be a datasets.Dataset` error Beginners	4	1073	May 29, 2023
Is Eval and Validation same in Trainer API? Beginners	4	1737	September 14, 2021
Guidance Needed on Choosing the Right Dataset Format for Transformer Model Training 🤗Datasets	1	1782	December 8, 2023

Type of dataset in Trainer class

Related topics