Sending a Dataset or DatasetDict to a GPU

Hi, relatively new user of Huggingface here, trying to do multi-label classfication, and basing my code off this example.

I have put my own data into a DatasetDict format as follows:

df2 = df[['text_column', 'answer1', 'answer2']].head(1000)
df2['text_column'] = df2['text_column'].astype(str)
dataset = Dataset.from_pandas(df2)

# train/test/validation split
train_testvalid = dataset.train_test_split(test_size=0.1)
test_valid = train_testvalid["test"].train_test_split(test_size=0.5)

# put into DatasetDict
datasets = DatasetDict({
    "train": train_testvalid["train"],
    "test": test_valid["test"],
    "valid": test_valid["train"]})

Later, I load the model using

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id).to(device)

and then check that the model is on GPU using next(model.parameters()).is_cuda ; if I comment out .to(device), the model is not sent to GPU.

My problem comes when it’s time to train the model as follows:

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"].to(device),
    eval_dataset=encoded_dataset["valid"].to(device),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

which causes the error:

AttributeError: 'Dataset' object has no attribute 'to'

But if I don’t try and send the train and eval datasets to GPU, I get the error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

So my question is: how can I send these datasets to the GPU - where the model already is - in order to efficiently train and validate the model using them?

Thank you!

By default, the Trainer will use the GPU if it is available. It will automatically put the model on te GPU as well as each batch as soon as that’s necessary. So just remove all .to() calls that you made manually.

Hi! As @BramVanroy pointed out, our Trainer class uses GPUs by default (if they are available from PyTorch), so you don’t need to manually send the model to GPU. And to fix the issue with the datasets, set their format to torch with .with_format("torch") to return PyTorch tensors when indexed.

Thanks you both for responding so quickly.

I can confirm that GPU is available using torch.cuda.is_available(), and I have also done .set_format("torch") on the Datasets. I have removed any explicit .to() calls.

However, if I remove the explicit .to() call on the model, then the model is no longer on the GPU according to next(model.parameters()).is_cuda → returns False.

More importantly, I also still get the Runtime Error:
Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

I was trying to post a minimal example above, but I now
suspect that the problem is that some of the encoding is done on the GPU perhaps? So here is a larger section of code - apologies in advance if this is excessive, but not sure which part is causing the error.

from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_data(examples):
  # take a batch of texts
  text = examples["answer_no_tags"]
  # encode them
  encoding = tokenizer(text, padding="max_length", truncation=True, max_length=128)
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()
  
  return encoding

encoded_dataset = datasets.map(preprocess_data, batched=True, 
                              remove_columns=dataset.column_names)

Loading model

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned-sem_eval-english",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch
    
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

Finally, training:

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["valid"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

Which returns:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

Is this the entire code? I can’t find the part where you change the dataset format to torch.

The model will be moved to the GPU after you initialize the trainer - not before that.

You can verify that the trainer will make use of the GPU by checking trainer.args.device. If that is a GPU, then everything the trainer does will correctly use the GPU.

What I suspect instead is that there is a discrepancy between devices in your custom multi_label_metrics function, which the trainer of course does not control. Check whether predictions and labels are on the same device.

Oh, I think this is a Transformers bug (see When running the Trainer cell, it found two devices (cuda:0 and CPU) · Issue #31 · nlp-with-transformers/notebooks · GitHub). Updating Transformers to the newest version with pip install -U transformers should fix the issue.

2 Likes

Thanks for your help @BramVanroy and @mariosasko - much appreciated. I updated Transformers and that has fixed the error, so have marked as solution. Cheers.