PyTorch version

david-waterworth · December 15, 2020, 12:37am

I’m having all sorts of issues training transformers on my 3090 - this card requires the cuda 11.1 which in turn requires torch 1.7.1

Is this supported? I’m using python 3.7 (I tried 3.9 but there’s no wheel for one of the dependencies for datasets and it wouldn’t build so I rolled back).

I installed pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

Training of RobertaForMaskedLM frequently crashes with CUDA exceptions. If this should work I’ll create specific issues.

david-waterworth · December 15, 2020, 7:02am

OK So I tracked down the crash. The problem was the position embedding. I had max_seq_length == max_position_embeddings, this results in a position index > max_position_embeddings for any sequence which is truncated.

This is because create_position_ids_from_input_ids in modeling_roberta.py' below adds pdding_idx to the cumsum - if there are no masked input_ids this will be > max_seq_length`

mask = input_ids.ne(padding_idx).int()
incremental_indices = torch.cumsum(mask, dim=1).type_as(mask) * mask
return incremental_indices.long() + padding_idx

sgugger · December 15, 2020, 9:25pm

Hi there!
It’s hard to know exactly what’s going on without seeing your code but here is what I can share about RoBERTa. You should not use max_position_embeddings as a maximum sequence length. Because the position IDs of RoBERTa go from padding_index to maximum_sequence_length + padding_index, this max_position_embeddings is purposely set to 514 (2, the padding index + 512, the maximum sequence length). You should use tokenizer.model_max_length instead (which should be 512).

david-waterworth · December 16, 2020, 8:45am

Yes that’s what I worked out, I’m not sure why the position ID’s go from padding_index to maximum_sequence_length + padding_index though (I actually think it starts at padding_index + 1)? You could set the padding_index of the position_ids embedding to 0 (i.e. different to the padding index of the input_ids) and as far as I can see it still works fine since the actual indices are always > 0 and there’s no chance of an index out of range (which doesn’t produce a meaningful exception from CUDA)

Also did the original Bert / RoBERTa not use sinusoidal position embeddings? Or was that a later addition?

sgugger · December 16, 2020, 3:07pm

This is all to mimic the original implementation of RoBERTa. So no, RoBERTa does not use sinusoidal position embeddings. That’s also why we can’t change the padding_index for the posistion_ids as it would break from the pretrained models.

david-waterworth · December 16, 2020, 9:52pm

Thanks that makes sense - although I used the value of 512 for the position embedding size based on PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES from source - this is also the default for BertConfig from what I can see.

Plus it still crashes when max_position_embeddings = model_max_length + padding_index, I think it should be max_position_embeddings = model_max_length + padding_index + 1 (since the the indices start from 1 not zero)

david-waterworth · December 20, 2020, 11:02pm

One thing to note is tokenizer.model_max_length doesn’t work - it’s always set to int(1e30). The reason is tokenization_utils_base.py#L1857 only sets model_max_length for the hard coded list of pre-trained huggingface tokenizers. If you train your own I don’t see any way of setting it from config when using AutoTokenizer().from_pretrained()

TopRightExit · July 12, 2022, 2:31pm

When I was using run_mlm.py script to train a RoBERTa model, I also had the issue of not being able to set max_seq_length = 512. Thanks to this information on max_position_embeddings, I had to set --config_overrides argument to “max_position_embeddings=514” when executing run_mlm.py with max_seq_length = 512. The error did not show up a few training steps in. My model training is still in progress.

Topic		Replies	Views
Positional encoding error in RoBERTa 🤗Transformers	1	337	October 2, 2023
Positional Encoding error, Protein Bert Model Intermediate	2	654	October 25, 2020
Claritifcation about the `max_position_embeddings` argument 🤗Transformers	1	502	January 27, 2023
Error using `max_length` in transformers 🤗Transformers	3	2703	February 26, 2021
Different size of Roberta-base tokenizer and model embedding Beginners	1	1129	March 1, 2022

PyTorch version

Related topics