Dear huggingface community,
I’m trying to train a phi-2 model as per this tutorial. Phinetuning 2.0. Finetune Microsoft’s Phi-2 with QLoRA… | by Geronimo | Medium
I’m using a quantized four bit model of phi-2. However, when calling the Trainer function, I get the following error (see below for full error code)
RuntimeError: Expected a ‘cuda’ device type for generator but found ‘cpu’
Precisely, this is the trainer function I’m calling:
args = TrainingArguments(
output_dir="out",
per_device_train_batch_size=bs,
per_device_eval_batch_size=16,
evaluation_strategy="steps",
logging_steps=1,
eval_steps=steps_per_epoch//2, # eval twice per epoch
save_steps=steps_per_epoch, # save once per epoch
gradient_accumulation_steps=ga_steps,
num_train_epochs=epochs,
lr_scheduler_type="constant",
optim="paged_adamw_32bit", # val_loss will go NaN with paged_adamw_8bit
learning_rate=lr,
group_by_length=False,
bf16=True,
ddp_find_unused_parameters=False,
)
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=args,
data_collator=collate,
train_dataset=dataset_tokenized["train"],
eval_dataset=dataset_tokenized["test"],
)
Torch is set to cuda. So, this isn’t the problem.
I’ve tried running this locally and on colab. In colab, I was using Pytorch 2.2.1 and Cuda 12.1.
Solutions from the internet like the following do not work in my case. I also find it hard to investigate which package is responsible for the error.
torch.utils.data.DataLoader(
...,
generator=torch.Generator(device='cuda'),
)
Thanks so much for any help!
full error code:
RuntimeError Traceback (most recent call last)
<ipython-input-52-8d9b9c974db4> in <cell line: 42>()
40 )
41
---> 42 trainer.train()
43
44
9 frames
/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py in __torch_function__(self, func, types, args, kwargs)
75 if func in _device_constructors() and kwargs.get('device') is None:
76 kwargs['device'] = self.device
---> 77 return func(*args, **kwargs)
78
79 # NB: This is directly called from C++ in torch/csrc/Device.cpp
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'