Runtime Error: Trainer API Dataloader Using CPU but Expecting CUDA

tobyrm · May 5, 2023, 4:26am

I’m fine tuning a model like this:

            ds = datasets.Dataset.from_pandas(df_train[['text', 'label']])
            ds = ds.class_encode_column('label')
            ds = ds.train_test_split(test_size=0.2, stratify_by_column='label')
            ds1 = datasets.Dataset.from_pandas(df_test[['text', 'label']]).class_encode_column('label')
            ds = DatasetDict({
                'train': ds['train'],
                'val': ds['test'],
                'test': ds1,
            })

            data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
            tokenized_ds = ds.map(preprocess_function, batched=True)

            early_stop = EarlyStoppingCallback(early_stopping_patience=2)

            output_dir = bucket_dir + f'/llm/condition_models/{__version__}/{self.fingerprint}'
            os.makedirs(output_dir, exist_ok=True)

            # Train and store model
            training_args = TrainingArguments(
                output_dir=output_dir,
                learning_rate=2e-5,
                evaluation_strategy='epoch',
                save_strategy='epoch',
                per_device_train_batch_size=64,
                per_device_eval_batch_size=64,
                num_train_epochs=50,
                weight_decay=0.01,
                load_best_model_at_end=True,
                # dataloader_pin_memory=False,
            )
            trainer = Trainer(
                model=model,
                args=training_args,
                train_dataset=tokenized_ds["train"],
                eval_dataset=tokenized_ds["val"],
                tokenizer=tokenizer,
                data_collator=data_collator,
                callbacks=[early_stop],
            )
            trainer.train()

which produces the following error:

RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

Indeed, when I check the device attribute in question:

trainer.get_train_dataloader().batch_sampler.sampler.generator.device

it shows that it is “cpu” despite CUDA being available and having torch.set_default_tensor_type('torch.cuda.FloatTensor') at the top of my module.

I tried overwriting the device on the generator and tried overwriting the sampler, but neither is allowed.

I am using transformers==4.28.1 and torch==2.0.0.

I’m not sure where to go from here. Advice much appreciated.

inkognito1982 · December 12, 2023, 12:10pm

Hi. I am having the same error as well while using the latest versions of transformers and trl library.
I checked the trainer.get_train_dataloader().device and it shows as “cuda” but I get None for trainer.get_train_dataloader().generator .
Not sure how to proceed.

RuntimeError                              Traceback (most recent call last)
<ipython-input-77-3435b262f1ae> in <cell line: 1>()
----> 1 trainer.train()

10 frames
/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py in __torch_function__(self, func, types, args, kwargs)
     75         if func in _device_constructors() and kwargs.get('device') is None:
     76             kwargs['device'] = self.device
---> 77         return func(*args, **kwargs)
     78 
     79 # NB: This is directly called from C++ in torch/csrc/Device.cpp

RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

inkognito1982 · December 22, 2023, 9:42pm

Ok. Finally figured it out. Just in the future, if anyone else stumbles upon this error, it usually happens if we set:

torch.set_default_device('cuda')

Topic		Replies	Views
Training model: RuntimeError: Expected a 'cuda' device type for generator but found 'cpu' Beginners	1	589	April 21, 2024
[HELP] RuntimeError: CUDA error: device-side assert triggered Beginners	20	43380	October 23, 2024
RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CU 🤗Transformers	2	46	November 1, 2024
Trainer.train throws RuntimeError: Expected all tensors to be on the same device Beginners	5	2915	May 17, 2023
Can I use CUDA with Trainer.train? Beginners	3	7164	May 10, 2022

Runtime Error: Trainer API Dataloader Using CPU but Expecting CUDA

Related Topics