Dataset expected by Trainer

Hello everyone,
I am rewriting some old code to use the new tokenizer syntax and the Trainer class but I believe I am missing something.

This is how I am building the training dataset to be passed to the Trainer constructor:

    encoded_texts = tokenizer(texts, padding = True, truncation = True, return_tensors = 'pt')
    labels = torch.tensor(labels)
    dataset = TensorDataset(encoded_texts['input_ids'], encoded_texts['attention_mask'], labels)

Can you please help me understand what I am doing wrong/missing? When I run trainer.train() I get the following error:
vars() argument must have __dict__ attribute

Thanks in advance! :blush:

1 Like

I’ll give you the full picture.

The workflow:

You create an instance of GlueDataset(data_args, tokenizer). Then you pass it to Trainer(...) class. In trainer, you also pass in default_data_collator. The reason is that GlueDataset return InputExample which is HF specific and cannot be used by Pytorch dataloader directly. So the default_data_collator takes in List[InputExamples] and returns a dict. This dict is then used by the dataloader.

So in trainer, if you pass default_data_collator with TensorDataset, it won’t work directly (That’s why you’re getting the error). This error is raised when dataloader will pass the batch to default_data_collator. I’d suggest using the default Pytorch collate_fn with your TensorDataset, it would work just fine.

One more additional thing:

Make sure the dataloader returns the dict with same key values forward method expects.
Inside _training_step, you’ll pass inputs to the function, and then after the inputs are passed kept on gpu, the function does:
output = model(**inputs)
In this case, the keyword arguments have to match. In case, they don’t, you can inherit from Trainer and redefine your own method.

I hope this answers your question.

7 Likes

Thank you very much for your help!

I tried different approaches; currently I am testing a solution that involves writing my own data_collator to pass to the Trainer instance:

def dummy_data_collector(features):
    batch = {}
    batch['input_ids'] = torch.stack([f[0] for f in features])
    batch['attention_mask'] = torch.stack([f[1] for f in features])
    batch['labels'] = torch.stack([f[2] for f in features])
    
    return batch

This seems to be working, but I will need to do some more testing :slight_smile:

I’m glad it worked out.

I am not sure I understand how the Trainer class identifies the target from the features. In the given example in the documentation, Trainer is passed “train_dataset” with no X and Y specified. Is this already assigned somewhere on the dataset object?

from transformers import BartForSequenceClassification, Trainer, TrainingArguments
model = BartForSequenceClassification.from_pretrained(“facebook/bart-large-cnn”)

training_args = TrainingArguments(
output_dir=’./results’, # output directory
num_train_epochs=3, # total # of training epochs
per_device_train_batch_size=2, # batch size per device during training
per_device_eval_batch_size=2, # batch size for evaluation
warmup_steps=100, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir=’./logs’, # directory for storing logs)

trainer = Trainer(
model=model, # the instantiated :hugs: Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_df, # training dataset**
** eval_dataset=eval_df)**

All :hugs: Transformers models will return the loss when fed with the inputs and labels (usually named labels). The Trainer thus expects each element of the dataset you pass to be a dictionary with all the inputs the model expects to return the loss (including those labels).