TransformerXL grad can be implicitly created only for scalar outputs

I am trying to execute for TransformerXL (transfo-xl-wt103) but get the following error:

0% 0/10170 [00:00<?, ?it/s]/usr/local/lib/python3.6/dist-packages/transformers/ UserWarning: This overload of nonzero is deprecated:
Consider using one of the following signatures instead:
nonzero(*, bool as_tuple) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
indices_i = mask_i.nonzero().squeeze()
Traceback (most recent call last):
File “language-modeling/”, line 352, in
File “language-modeling/”, line 321, in main
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
File “/usr/local/lib/python3.6/dist-packages/transformers/”, line 775, in train
tr_loss += self.training_step(model, inputs)
File “/usr/local/lib/python3.6/dist-packages/transformers/”, line 1126, in training_step
File “/usr/local/lib/python3.6/dist-packages/torch/”, line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/usr/local/lib/python3.6/dist-packages/torch/autograd/”, line 126, in backward
grad_tensors_ = make_grads(tensors, grad_tensors)
File “/usr/local/lib/python3.6/dist-packages/torch/autograd/”, line 50, in _make_grads
raise RuntimeError(“grad can be implicitly created only for scalar outputs”)
RuntimeError: grad can be implicitly created only for scalar outputs
0% 0/10170 [00:00<?, ?it/s]

I haven’t changed the original ( code. I am using:

!python language-modeling/
–output_dir=’/content/drive/My Drive/XL-result’
–train_file=’/content/drive/My Drive/train.txt’
–validation_file=’/content/drive/My Drive/test.txt’
–learning_rate 5e-5
–seed 42
–block_size 125

Any help would be much appreciated!

Have you read this?

If you haven’t changed the run_clm code, something else must be different. What versions of python, pytorch and huggingface_transformers are you using?

Are you using CPU/GPU/TPU ? Are you using DataParallel (whatever that is)?

@lcrivell vell are you able to solve this problem? I am also getting this error while finetuning the bert-base-cased on the mnli dataset.

@rgwatwormhill how to use this in the trainer.