Training GPT2 on CPUs?

I want to make a pseudo-GPT3(GPT2 with more layers), but it’s too large for my 2*V100. And when I try to train on CPU, it gives me an error:

Traceback (most recent call last):                  | 0/1260608 [00:00<?, ?it/s]

File “examples/language-modeling/”, line 336, in
File “examples/language-modeling/”, line 300, in main
File “/home/ksjae/.local/lib/python3.7/site-packages/transformers/”,
line 741, in train
tr_loss += self.training_step(model, inputs)
File “/home/ksjae/.local/lib/python3.7/site-packages/transformers/”,
line 1055, in training_step
File “/home/ksjae/.local/lib/python3.7/site-packages/torch/cuda/amp/grad_scale”, line 156, in scale
assert outputs.is_cuda

Seems like you tried to use AMP , which requires CUDA. You should not use that flag when running in CPU mode.

What’s AMP?

Oh, and using block_size of 2048 causes this error:

result = self.forward(*input, **kwargs)

File “/home/ksjae/.local/lib/python3.7/site-packages/transformers/”, line 594, in forward
position_embeds = self.wpe(position_ids)
File “/home/ksjae/.local/lib/python3.7/site-packages/torch/nn/modules/”, line 722, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/ksjae/.local/lib/python3.7/site-packages/torch/nn/modules/”, line 126, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File “/home/ksjae/.local/lib/python3.7/site-packages/torch/nn/”, line 1814, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

You are probably using --fp16, which triggers AMP (automatic mixed precision), which is not supported on CPU.

1 Like