PyTorch version

Yes that’s what I worked out, I’m not sure why the position ID’s go from padding_index to maximum_sequence_length + padding_index though (I actually think it starts at padding_index + 1)? You could set the padding_index of the position_ids embedding to 0 (i.e. different to the padding index of the input_ids) and as far as I can see it still works fine since the actual indices are always > 0 and there’s no chance of an index out of range (which doesn’t produce a meaningful exception from CUDA)

Also did the original Bert / RoBERTa not use sinusoidal position embeddings? Or was that a later addition?