Can you please help me understand what I am doing wrong/missing? When I run trainer.train() I get the following error: vars() argument must have __dict__ attribute
You create an instance of GlueDataset(data_args, tokenizer). Then you pass it to Trainer(...) class. In trainer, you also pass in default_data_collator. The reason is that GlueDataset return InputExample which is HF specific and cannot be used by Pytorch dataloader directly. So the default_data_collator takes in List[InputExamples] and returns a dict. This dict is then used by the dataloader.
So in trainer, if you pass default_data_collator with TensorDataset, it won’t work directly (That’s why you’re getting the error). This error is raised when dataloader will pass the batch to default_data_collator. I’d suggest using the default Pytorch collate_fn with your TensorDataset, it would work just fine.
One more additional thing:
Make sure the dataloader returns the dict with same key values forward method expects.
Inside _training_step, you’ll pass inputs to the function, and then after the inputs are passed kept on gpu, the function does: output = model(**inputs)
In this case, the keyword arguments have to match. In case, they don’t, you can inherit from Trainer and redefine your own method.
I tried different approaches; currently I am testing a solution that involves writing my own data_collator to pass to the Trainer instance:
def dummy_data_collector(features):
batch = {}
batch['input_ids'] = torch.stack([f[0] for f in features])
batch['attention_mask'] = torch.stack([f[1] for f in features])
batch['labels'] = torch.stack([f[2] for f in features])
return batch
This seems to be working, but I will need to do some more testing
I am not sure I understand how the Trainer class identifies the target from the features. In the given example in the documentation, Trainer is passed “train_dataset” with no X and Y specified. Is this already assigned somewhere on the dataset object?
from transformers import BartForSequenceClassification, Trainer, TrainingArguments
model = BartForSequenceClassification.from_pretrained(“facebook/bart-large-cnn”)
training_args = TrainingArguments(
output_dir=’./results’, # output directory
num_train_epochs=3, # total # of training epochs
per_device_train_batch_size=2, # batch size per device during training
per_device_eval_batch_size=2, # batch size for evaluation
warmup_steps=100, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir=’./logs’, # directory for storing logs)
trainer = Trainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_df, # training dataset**
** eval_dataset=eval_df)**
All Transformers models will return the loss when fed with the inputs and labels (usually named labels). The Trainer thus expects each element of the dataset you pass to be a dictionary with all the inputs the model expects to return the loss (including those labels).