Masked language model for BART (Not BERT)

Hi, I’m trying to train a BART model using masking(MLM).
The model type is BartForConditionalGeneration. The task I have is text generation(key phrases) of an input text.

Before trying it on a custom dataset, I wanted to try it on the given official huggingface example here, which is in fact similar to huggingface github example

To save space and not past the entire code as is, I changed the model to one suited for my task i found on huggingface. [Everything else is the same, plus enabling this variable to help in CUDA stack debuging os.environ[‘CUDA_LAUNCH_BLOCKING’]=“1” ]

model_checkpoint = "distilbert-base-uncased" model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
to

model_checkpoint = "memray/bart-wikikp"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Based on provided documentation, this unsupervised approach is viable if one wants to fine-tune the model for a specific domain. Therefore, before fine-tuning, Masked language modelling helps acquaint the model with the new corpus first. Also, in the documentation there is no mention of specific-tasks, e.g. only applicable for QA(question-answering) or text classification.

Please note that in the first link above the imdb dataset is used.

The errors i get are CUDA related, (using GPU) when training

trainer.train()

Error
“***** Running training *****
Num examples = 10000
Num Epochs = 3
Instantaneous batch size per device = 32
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 939
0%| | 0/939 [00:00<?, ?it/s]/opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [42,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
/opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [42,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
/opt/conda/conda-bld/pytorch_1646755953518/work/aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [42,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.

Traceback (most recent call last):
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/IPython/core/interactiveshell.py”, line 3457, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File “”, line 1, in
trainer.train()
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/transformers/trainer.py”, line 1413, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/transformers/trainer.py”, line 1651, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/transformers/trainer.py”, line 2345, in training_step
loss = self.compute_loss(model, inputs)
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/transformers/trainer.py”, line 2377, in compute_loss
outputs = model(**inputs)
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1110, in _call_impl
return forward_call(*input, **kwargs)
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/transformers/models/bart/modeling_bart.py”, line 1368, in forward
return_dict=return_dict,
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1110, in _call_impl
return forward_call(*input, **kwargs)
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/transformers/models/bart/modeling_bart.py”, line 1229, in forward
return_dict=return_dict,
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1110, in _call_impl
return forward_call(*input, **kwargs)
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/transformers/models/bart/modeling_bart.py”, line 850, in forward
output_attentions=output_attentions,
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1110, in _call_impl
return forward_call(*input, **kwargs)
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/transformers/models/bart/modeling_bart.py”, line 327, in forward
output_attentions=output_attentions,
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1110, in _call_impl
return forward_call(*input, **kwargs)
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/transformers/models/bart/modeling_bart.py”, line 191, in forward
query_states = self.q_proj(hidden_states) * self.scaling
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1110, in _call_impl
return forward_call(*input, **kwargs)
File “/home/haddad/.conda/envs/hugg/lib/python3.7/site-packages/torch/nn/modules/linear.py”, line 103, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
/”

If using accelerator module, i get

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.)

Everything I read says that there is a mismatch between the model head, input tensor or other tensors causing this to happen. It seems BART needs something extra to be adapted to MLM?!

[UPDATE]
So commenting the collator argument gets the trainer.train() to work, otherwise CUDA faces lots of issues whether with the accelerator or not.

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    # data_collator=data_collator,
)

One thing i suspect, is that the model input embeddings is the vocab size (50265) x 1024!
Whereas the imdb is chunked into 128.
How can i change the model to adopt that input dimension? 1024 → 128

hi @ahadda5 ,

is there config setting different?

setting config’s max length size or hidden layer dimension.

HF bart config docs

Also, if you want to build BART for Masked LM, add some last layers to predict hidden layer’s output to your output dimension.

for example, in BertForMaskedLM, class BertLMPredictionHead(nn.Module) fit dimension of hidden to output(vocab)

self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

you can get hint from BertForMaskedLM 's layer structure to build BART Masked LM .

hope to help.

thanks for your reply. I will investigate further.
But debugging the BART beast,
encoder layer dims are consistent with decoder layers.
Labels are also consistent with input ids for Masking.

I’ll take a look at BERT

Okay so the issue was that the used model has a vocab size, 50264! however the tokenizer has a size of 50265!

So had to resize_token_embedding the model to that of tokenizer!
thanks @sgugger , @cog for the guidance. Happy coding :slight_smile: !

1 Like