I’m fine tuning a model like this:
ds = datasets.Dataset.from_pandas(df_train[['text', 'label']])
ds = ds.class_encode_column('label')
ds = ds.train_test_split(test_size=0.2, stratify_by_column='label')
ds1 = datasets.Dataset.from_pandas(df_test[['text', 'label']]).class_encode_column('label')
ds = DatasetDict({
'train': ds['train'],
'val': ds['test'],
'test': ds1,
})
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
tokenized_ds = ds.map(preprocess_function, batched=True)
early_stop = EarlyStoppingCallback(early_stopping_patience=2)
output_dir = bucket_dir + f'/llm/condition_models/{__version__}/{self.fingerprint}'
os.makedirs(output_dir, exist_ok=True)
# Train and store model
training_args = TrainingArguments(
output_dir=output_dir,
learning_rate=2e-5,
evaluation_strategy='epoch',
save_strategy='epoch',
per_device_train_batch_size=64,
per_device_eval_batch_size=64,
num_train_epochs=50,
weight_decay=0.01,
load_best_model_at_end=True,
# dataloader_pin_memory=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_ds["train"],
eval_dataset=tokenized_ds["val"],
tokenizer=tokenizer,
data_collator=data_collator,
callbacks=[early_stop],
)
trainer.train()
which produces the following error:
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
Indeed, when I check the device attribute in question:
trainer.get_train_dataloader().batch_sampler.sampler.generator.device
it shows that it is “cpu” despite CUDA being available and having torch.set_default_tensor_type('torch.cuda.FloatTensor')
at the top of my module.
I tried overwriting the device on the generator and tried overwriting the sampler, but neither is allowed.
I am using transformers==4.28.1
and torch==2.0.0
.
I’m not sure where to go from here. Advice much appreciated.