I’m having all sorts of issues training transformers on my 3090 - this card requires the cuda 11.1 which in turn requires torch 1.7.1
Is this supported? I’m using python 3.7 (I tried 3.9 but there’s no wheel for one of the dependencies for
datasets and it wouldn’t build so I rolled back).
pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html
RobertaForMaskedLM frequently crashes with CUDA exceptions. If this should work I’ll create specific issues.
OK So I tracked down the crash. The problem was the position embedding. I had
max_seq_length == max_position_embeddings, this results in a position index > max_position_embeddings for any sequence which is truncated.
This is because
modeling_roberta.py' below adds pdding_idx to the cumsum - if there are no masked input_ids
this will be > max_seq_length`
mask = input_ids.ne(padding_idx).int()
incremental_indices = torch.cumsum(mask, dim=1).type_as(mask) * mask
return incremental_indices.long() + padding_idx
It’s hard to know exactly what’s going on without seeing your code but here is what I can share about RoBERTa. You should not use
max_position_embeddings as a maximum sequence length. Because the position IDs of RoBERTa go from
maximum_sequence_length + padding_index, this
max_position_embeddings is purposely set to 514 (2, the padding index + 512, the maximum sequence length). You should use
tokenizer.model_max_length instead (which should be 512).
Yes that’s what I worked out, I’m not sure why the position ID’s go from
maximum_sequence_length + padding_index though (I actually think it starts at padding_index + 1)? You could set the padding_index of the position_ids embedding to 0 (i.e. different to the padding index of the input_ids) and as far as I can see it still works fine since the actual indices are always > 0 and there’s no chance of an index out of range (which doesn’t produce a meaningful exception from CUDA)
Also did the original Bert / RoBERTa not use sinusoidal position embeddings? Or was that a later addition?
This is all to mimic the original implementation of RoBERTa. So no, RoBERTa does not use sinusoidal position embeddings. That’s also why we can’t change the
padding_index for the
posistion_ids as it would break from the pretrained models.
Thanks that makes sense - although I used the value of 512 for the position embedding size based on PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES from source - this is also the default for BertConfig from what I can see.
Plus it still crashes when max_position_embeddings = model_max_length + padding_index, I think it should be max_position_embeddings = model_max_length + padding_index + 1 (since the the indices start from 1 not zero)
One thing to note is
tokenizer.model_max_length doesn’t work - it’s always set to int(1e30). The reason is tokenization_utils_base.py#L1857 only sets model_max_length for the hard coded list of pre-trained huggingface tokenizers. If you train your own I don’t see any way of setting it from config when using AutoTokenizer().from_pretrained()
When I was using run_mlm.py script to train a RoBERTa model, I also had the issue of not being able to set max_seq_length = 512. Thanks to this information on
max_position_embeddings, I had to set --config_overrides argument to “max_position_embeddings=514” when executing run_mlm.py with max_seq_length = 512. The error did not show up a few training steps in. My model training is still in progress.