Dataset expected by Trainer

Isabella · July 9, 2020, 3:31pm

Hello everyone,
I am rewriting some old code to use the new tokenizer syntax and the Trainer class but I believe I am missing something.

This is how I am building the training dataset to be passed to the Trainer constructor:

    encoded_texts = tokenizer(texts, padding = True, truncation = True, return_tensors = 'pt')
    labels = torch.tensor(labels)
    dataset = TensorDataset(encoded_texts['input_ids'], encoded_texts['attention_mask'], labels)

Can you please help me understand what I am doing wrong/missing? When I run trainer.train() I get the following error:
vars() argument must have __dict__ attribute

Thanks in advance!

prajjwal1 · July 9, 2020, 3:54pm

I’ll give you the full picture.

The workflow:

You create an instance of GlueDataset(data_args, tokenizer). Then you pass it to Trainer(...) class. In trainer, you also pass in default_data_collator. The reason is that GlueDataset return InputExample which is HF specific and cannot be used by Pytorch dataloader directly. So the default_data_collator takes in List[InputExamples] and returns a dict. This dict is then used by the dataloader.

So in trainer, if you pass default_data_collator with TensorDataset, it won’t work directly (That’s why you’re getting the error). This error is raised when dataloader will pass the batch to default_data_collator. I’d suggest using the default Pytorch collate_fn with your TensorDataset, it would work just fine.

One more additional thing:

Make sure the dataloader returns the dict with same key values forward method expects.
Inside _training_step, you’ll pass inputs to the function, and then after the inputs are passed kept on gpu, the function does:
output = model(**inputs)
In this case, the keyword arguments have to match. In case, they don’t, you can inherit from Trainer and redefine your own method.

I hope this answers your question.

Isabella · July 10, 2020, 4:28pm

Thank you very much for your help!

I tried different approaches; currently I am testing a solution that involves writing my own data_collator to pass to the Trainer instance:

def dummy_data_collector(features):
    batch = {}
    batch['input_ids'] = torch.stack([f[0] for f in features])
    batch['attention_mask'] = torch.stack([f[1] for f in features])
    batch['labels'] = torch.stack([f[2] for f in features])
    
    return batch

This seems to be working, but I will need to do some more testing

prajjwal1 · July 11, 2020, 9:02am

I’m glad it worked out.

Buckeyes2019 · September 28, 2020, 8:56am

I am not sure I understand how the Trainer class identifies the target from the features. In the given example in the documentation, Trainer is passed “train_dataset” with no X and Y specified. Is this already assigned somewhere on the dataset object?

from transformers import BartForSequenceClassification, Trainer, TrainingArguments
model = BartForSequenceClassification.from_pretrained(“facebook/bart-large-cnn”)

training_args = TrainingArguments(
output_dir=’./results’, # output directory
num_train_epochs=3, # total # of training epochs
per_device_train_batch_size=2, # batch size per device during training
per_device_eval_batch_size=2, # batch size for evaluation
warmup_steps=100, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir=’./logs’, # directory for storing logs)

trainer = Trainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_df, # training dataset**
** eval_dataset=eval_df)**

sgugger · September 28, 2020, 2:32pm

All Transformers models will return the loss when fed with the inputs and labels (usually named labels). The Trainer thus expects each element of the dataset you pass to be a dictionary with all the inputs the model expects to return the loss (including those labels).

Topic		Replies	Views
What is happening in the trainer api, with data collator? Beginners	0	363	April 29, 2023
Using Datasets, DataCollators and DataLoaders to create an NLP data pipeline 🤗Datasets	1	4918	June 21, 2023
Pass tokenizer to Trainer when data is already tokenized? Beginners	0	470	August 25, 2023
How to use a data collator when dealing with text and images 🤗Transformers	0	1112	March 6, 2023
Type of dataset in Trainer class Beginners	3	2398	July 20, 2020

Dataset expected by Trainer

The workflow:

One more additional thing:

Related topics