Training model: RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

Dear huggingface community,

I’m trying to train a phi-2 model as per this tutorial. Phinetuning 2.0. Finetune Microsoft’s Phi-2 with QLoRA… | by Geronimo | Medium

I’m using a quantized four bit model of phi-2. However, when calling the Trainer function, I get the following error (see below for full error code)
RuntimeError: Expected a ‘cuda’ device type for generator but found ‘cpu’

Precisely, this is the trainer function I’m calling:

args = TrainingArguments(
    output_dir="out",
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=16,
    evaluation_strategy="steps",
    logging_steps=1,
    eval_steps=steps_per_epoch//2,      # eval twice per epoch
    save_steps=steps_per_epoch,         # save once per epoch
    gradient_accumulation_steps=ga_steps,
    num_train_epochs=epochs,
    lr_scheduler_type="constant",
    optim="paged_adamw_32bit",      # val_loss will go NaN with paged_adamw_8bit
    learning_rate=lr,
    group_by_length=False,
    bf16=True,
    ddp_find_unused_parameters=False,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=collate,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
)

Torch is set to cuda. So, this isn’t the problem.

I’ve tried running this locally and on colab. In colab, I was using Pytorch 2.2.1 and Cuda 12.1.

Solutions from the internet like the following do not work in my case. I also find it hard to investigate which package is responsible for the error.

torch.utils.data.DataLoader(
    ...,
    generator=torch.Generator(device='cuda'),
)

Thanks so much for any help!


full error code:

RuntimeError                              Traceback (most recent call last)
<ipython-input-52-8d9b9c974db4> in <cell line: 42>()
     40 )
     41 
---> 42 trainer.train()
     43 
     44 

9 frames
/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py in __torch_function__(self, func, types, args, kwargs)
     75         if func in _device_constructors() and kwargs.get('device') is None:
     76             kwargs['device'] = self.device
---> 77         return func(*args, **kwargs)
     78 
     79 # NB: This is directly called from C++ in torch/csrc/Device.cpp

RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
1 Like

I have the same issue and would be interested in a response as well. I additionally made sure that the dataset I want to finetune on is on ‘cuda’ as well as well as the model. However, that did not help either.