PyTorch version

I’m having all sorts of issues training transformers on my 3090 - this card requires the cuda 11.1 which in turn requires torch 1.7.1

Is this supported? I’m using python 3.7 (I tried 3.9 but there’s no wheel for one of the dependencies for datasets and it wouldn’t build so I rolled back).

I installed pip install torch==1.7.1+cu110 -f

Training of RobertaForMaskedLM frequently crashes with CUDA exceptions. If this should work I’ll create specific issues.

OK So I tracked down the crash. The problem was the position embedding. I had max_seq_length == max_position_embeddings, this results in a position index > max_position_embeddings for any sequence which is truncated.

This is because create_position_ids_from_input_ids in' below adds pdding_idx to the cumsum - if there are no masked input_ids this will be > max_seq_length`

mask =
incremental_indices = torch.cumsum(mask, dim=1).type_as(mask) * mask
return incremental_indices.long() + padding_idx

Hi there!
It’s hard to know exactly what’s going on without seeing your code but here is what I can share about RoBERTa. You should not use max_position_embeddings as a maximum sequence length. Because the position IDs of RoBERTa go from padding_index to maximum_sequence_length + padding_index, this max_position_embeddings is purposely set to 514 (2, the padding index + 512, the maximum sequence length). You should use tokenizer.model_max_length instead (which should be 512).

1 Like

Yes that’s what I worked out, I’m not sure why the position ID’s go from padding_index to maximum_sequence_length + padding_index though (I actually think it starts at padding_index + 1)? You could set the padding_index of the position_ids embedding to 0 (i.e. different to the padding index of the input_ids) and as far as I can see it still works fine since the actual indices are always > 0 and there’s no chance of an index out of range (which doesn’t produce a meaningful exception from CUDA)

Also did the original Bert / RoBERTa not use sinusoidal position embeddings? Or was that a later addition?

This is all to mimic the original implementation of RoBERTa. So no, RoBERTa does not use sinusoidal position embeddings. That’s also why we can’t change the padding_index for the posistion_ids as it would break from the pretrained models.

Thanks that makes sense - although I used the value of 512 for the position embedding size based on PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES from source - this is also the default for BertConfig from what I can see.

Plus it still crashes when max_position_embeddings = model_max_length + padding_index, I think it should be max_position_embeddings = model_max_length + padding_index + 1 (since the the indices start from 1 not zero)

One thing to note is tokenizer.model_max_length doesn’t work - it’s always set to int(1e30). The reason is only sets model_max_length for the hard coded list of pre-trained huggingface tokenizers. If you train your own I don’t see any way of setting it from config when using AutoTokenizer().from_pretrained()

When I was using script to train a RoBERTa model, I also had the issue of not being able to set max_seq_length = 512. Thanks to this information on max_position_embeddings, I had to set --config_overrides argument to “max_position_embeddings=514” when executing with max_seq_length = 512. The error did not show up a few training steps in. My model training is still in progress.