Summarization task fails with ProphetNet

I tried to generate summary for CNN/DM or XSUM using prophetnet by running the following code: (based on the codes from https://github.com/huggingface/transformers/tree/master/examples/seq2seq)

$ export DATA=cnndm
$ export DATA_DIR=data/$DATA
$ export OUTPUT_DIR=output/$DATA-prophetnet

$ python -m torch.distributed.launch --nproc_per_node=2  run_distributed_eval.py \
    --model_name microsoft/prophetnet-large-uncased-cnndm  \
    --save_dir $OUTPUT_DIR \
    --data_dir $DATA_DIR \
    --bs 32 \
    --task summarization_cnndm

Then I received the following error messages:

Index < srcSelectDimSize` failed.
  0%|                                                                                              | 0/180 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "run_distributed_eval.py", line 281, in <module>
    run_generate()
  File "run_distributed_eval.py", line 213, in run_generate
    **generate_kwargs,
  File "run_distributed_eval.py", line 123, in eval_data_dir
    **generate_kwargs,
  File "/home/rachelzheng/acl/venv/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/rachelzheng/acl/venv/lib/python3.6/site-packages/transformers/generation_utils.py", line 483, in generate
    model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
  File "/home/rachelzheng/acl/venv/lib/python3.6/site-packages/transformers/generation_utils.py", line 85, in _prepare_encoder_decoder_kwargs_for_generation
    model_kwargs["encoder_outputs"]: ModelOutput = encoder(input_ids, return_dict=True, **encoder_kwargs)
  File "/home/rachelzheng/acl/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/rachelzheng/acl/venv/lib/python3.6/site-packages/transformers/models/prophetnet/modeling_prophetnet.py", line 1225, in forward
    hidden_states, attn_probs = encoder_layer(hidden_states, attention_mask=extended_attention_mask)
  File "/home/rachelzheng/acl/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/rachelzheng/acl/venv/lib/python3.6/site-packages/transformers/models/prophetnet/modeling_prophetnet.py", line 1051, in forward
    attention_mask=attention_mask,
  File "/home/rachelzheng/acl/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/rachelzheng/acl/venv/lib/python3.6/site-packages/transformers/models/prophetnet/modeling_prophetnet.py", line 652, in forward
    query_states = self.query_proj(hidden_states) / (self.head_dim ** 0.5)
  File "/home/rachelzheng/acl/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/rachelzheng/acl/venv/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/rachelzheng/acl/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 1692, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8

Since this generation framework python -m torch.distributed.launch --nproc_per_node=2 run_distributed_eval.py works for other models, including BART, PEGASUS, I am not sure why it fails with ProphetNet here.