Issues with Trainer class on custom dataset

I am following along with the Sequence Classification on IMDB Dataset example here:
https://huggingface.co/transformers/master/custom_datasets.html?utm_campaign=Hugging%2BFace&utm_medium=web&utm_source=Hugging_Face_1

However, I am using a custom dataset. Rather than iterate through using read_imdb_split I load my data from a csv, get the value of the sequences and labels, then convert them to a list to pass in to the tokenizer() method. From there I convert them to datasets using the subclassed dataset object as shown in the documentation.

Next, I create TrainingArguments and Trainer as shown and my issue then arises when calling trainer.train(). I receive the following error stack trace:


AttributeError Traceback (most recent call last)
in
----> 1 trainer.train()

~\anaconda3\lib\site-packages\transformers\trainer.py in train(self, model_path)
512 self._past = None
513
→ 514 for step, inputs in enumerate(epoch_iterator):
515
516 # Skip past any already trained steps if resuming training

~\anaconda3\lib\site-packages\tqdm\notebook.py in iter(self, *args, **kwargs)
215 def iter(self, *args, **kwargs):
216 try:
→ 217 for obj in super(tqdm_notebook, self).iter(*args, **kwargs):
218 # return super(tqdm…) will not catch exception
219 yield obj

~\anaconda3\lib\site-packages\tqdm\std.py in iter(self)
1127
1128 try:
→ 1129 for obj in iterable:
1130 yield obj
1131 # Update and possibly print the progressbar.

~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py in next(self)
343
344 def next(self):
→ 345 data = self._next_data()
346 self._num_yielded += 1
347 if self._dataset_kind == _DatasetKind.Iterable and \

~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _next_data(self)
383 def _next_data(self):
384 index = self._next_index() # may raise StopIteration
→ 385 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
386 if self._pin_memory:
387 data = _utils.pin_memory.pin_memory(data)

~\anaconda3\lib\site-packages\torch\utils\data_utils\fetch.py in fetch(self, possibly_batched_index)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
—> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]

~\anaconda3\lib\site-packages\torch\utils\data_utils\fetch.py in (.0)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
—> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]

in getitem(self, idx)
5
6 def getitem(self, idx):
----> 7 item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
8 item[‘labels’] = torch.tensor(self.labels[idx])
9 return item

AttributeError: ‘list’ object has no attribute ‘items’

I figure this is a result of me loading my data differently and converting them to lists, but I pass the datasets, not lists, to the trainer so I am unclear what is causing the error.

The problem is when indexing into your custom dataset, not Trainer. If you want help with that, you’ll have to share your code on how it’s built. To debug your problem, try to see if you can do train_dataset[0] (or any other index).

1 Like

Thank you for the quick response. I spent the past few minutes typing up a long response with my code, explaining how I built the custom dataset and passed it along to the trainer when I noticed a typo in my code. Turns out I was sending just the list of emails, not the encoded ones, when I called the subclassed dataset function to create my dataset. trainer.train() appears to be functioning correctly now and I will update this if I run into any further issues during this finetuning loop or with my results.

1 Like

Hi,
I am also facing the same problem, it will be very helpful if you can help me because I am still a beginner at using LLMs and hugging face. I am pasting my code below.

from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
import datasets
import torch as t
from transformers import DataCollatorWithPadding
import evaluate
import numpy as np

dataset = datasets.load_dataset(“tweet_eval”,“emotion”)
x_train = dataset[“train”][“text”]
y_train = dataset[“train”][“label”]
x_test = dataset[“test”][“text”]
y_test = dataset[“test”][“label”]
def load_LLM(llm, device):
num_labels = 4
id2label = {0: “Anger”, 1: “Joy”, 2: “Optimism”, 3: “Sadness”}
label2id = {“Anger”: 0, “Joy”: 1, “Optimism”: 2, “Sadness”:3}
model = AutoModelForSequenceClassification.from_pretrained(llm,num_labels=num_labels,id2label=id2label, label2id=label2id)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(llm)
return model, tokenizer
llm = “EleutherAI/gpt-neo-2.7B”
device = t.device(‘cuda’ if t.cuda.is_available() else ‘cpu’)
model,tokenizer = load_LLM(llm,device)
train_inputs = tokenizer(x_train,truncation=True)
test_inputs = tokenizer(x_test,truncation=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
accuracy = evaluate.load(“accuracy”)
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
training_args = TrainingArguments(
output_dir=“my_awesome_model”,
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
evaluation_strategy=“epoch”,
save_strategy=“epoch”,
load_best_model_at_end=True,
push_to_hub=True,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_inputs,
eval_dataset=test_inputs,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)

trainer.train()

I am getting the exact same error and I don’t know how to resolve it.