Runtime Error: Trainer API Dataloader Using CPU but Expecting CUDA

I’m fine tuning a model like this:

            ds = datasets.Dataset.from_pandas(df_train[['text', 'label']])
            ds = ds.class_encode_column('label')
            ds = ds.train_test_split(test_size=0.2, stratify_by_column='label')
            ds1 = datasets.Dataset.from_pandas(df_test[['text', 'label']]).class_encode_column('label')
            ds = DatasetDict({
                'train': ds['train'],
                'val': ds['test'],
                'test': ds1,
            })

            data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
            tokenized_ds = ds.map(preprocess_function, batched=True)

            early_stop = EarlyStoppingCallback(early_stopping_patience=2)

            output_dir = bucket_dir + f'/llm/condition_models/{__version__}/{self.fingerprint}'
            os.makedirs(output_dir, exist_ok=True)

            # Train and store model
            training_args = TrainingArguments(
                output_dir=output_dir,
                learning_rate=2e-5,
                evaluation_strategy='epoch',
                save_strategy='epoch',
                per_device_train_batch_size=64,
                per_device_eval_batch_size=64,
                num_train_epochs=50,
                weight_decay=0.01,
                load_best_model_at_end=True,
                # dataloader_pin_memory=False,
            )
            trainer = Trainer(
                model=model,
                args=training_args,
                train_dataset=tokenized_ds["train"],
                eval_dataset=tokenized_ds["val"],
                tokenizer=tokenizer,
                data_collator=data_collator,
                callbacks=[early_stop],
            )
            trainer.train()

which produces the following error:

RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

Indeed, when I check the device attribute in question:

trainer.get_train_dataloader().batch_sampler.sampler.generator.device

it shows that it is “cpu” despite CUDA being available and having torch.set_default_tensor_type('torch.cuda.FloatTensor') at the top of my module.

I tried overwriting the device on the generator and tried overwriting the sampler, but neither is allowed.

I am using transformers==4.28.1 and torch==2.0.0.

I’m not sure where to go from here. Advice much appreciated.

Hi. I am having the same error as well while using the latest versions of transformers and trl library.
I checked the trainer.get_train_dataloader().device and it shows as “cuda” but I get None for trainer.get_train_dataloader().generator .
Not sure how to proceed.

RuntimeError                              Traceback (most recent call last)
<ipython-input-77-3435b262f1ae> in <cell line: 1>()
----> 1 trainer.train()

10 frames
/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py in __torch_function__(self, func, types, args, kwargs)
     75         if func in _device_constructors() and kwargs.get('device') is None:
     76             kwargs['device'] = self.device
---> 77         return func(*args, **kwargs)
     78 
     79 # NB: This is directly called from C++ in torch/csrc/Device.cpp

RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

Ok. Finally figured it out. Just in the future, if anyone else stumbles upon this error, it usually happens if we set:

torch.set_default_device('cuda')
4 Likes