@valhalla
I encountered this error while trying to utilize multiple gpu for training:
Traceback (most recent call last):
File "run_finetune_gpt2.py", line 158, in <module>
main()
File "run_finetune_gpt2.py", line 145, in main
trainer.train()
File "/path/to/venvs/my-venv/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
tr_loss += self._training_step(model, inputs, optimizer)
File "/path/to/venvs/my-venv/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
outputs = model(**inputs)
File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
return self.gather(outputs, self.output_device)
File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
return gather(outputs, output_device, dim=self.dim)
File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
assert all(map(lambda i: i.is_cuda, inputs))
AssertionError
wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb: You can sync this run to the cloud by running:
wandb: wandb sync wandb/dryrun-20200914_134757-1sih3p0q
Any thoughts about how to correct it?