I am interested in taking a Roberta model and turning into a BigBird model that can handle long sequences. In the BigBird paper, they mention doing a warm start from the Roberta checkpoint, which I interpret to mean they loaded all the pre-trained weights into the new model before doing more training on long sequences. It looks like all of the model components are identical, apart from the full attention being replaced with block sparse attention and the positional embeddings being different sizes.
The main problem I have right now is dealing with the positional embeddings. BigBird’s embeddings will be much larger than Roberta’s because the positional embeddings have shape (
max_position_embeddings = 514
max_position_embeddings = 4096
Any ideas on how I should handle this?
My only ideas right now are:
- Reinit all positional embeddings
- Use Roberta positional embeddings for the first 514 and then reinit the remaining.
- Use the pretrained BigBird positional embeddings (but this feels like cheating and is definitely not what the original authors did).