Use LongT5 model with T5 checkpoint

The LongT5 paper claims LongT5 is compatible with T5 checkpoints.

We experiment with two attention mechanism variations for LongT5 […]: (1) Local Attention and (2) Transient Global Attention(TGlobal).
Both variations preserve several properties of T5: […] compatibility with T5 checkpoints.

How do I accomplish this with transformers?

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = LongT5ForConditionalGeneration.from_pretrained("t5-small")

The console output tells me there is a difference in naming:

  • encoder.block.0.layer.0.SelfAttention.o.weight for T5
  • encoder.block.0.layer.0.LocalSelfAttention.o.weight for LongT5

Is there a way to map the weight names when loading a checkpoint? Or should I download the checkpoint and modify the files before loading it?

Background:

For my use case, I have long input sequences and frequently run into memory problems. LongT5 looks like a good candidate, but the smallest officially released checkpoint is google/long-t5-<type>-base. That’s why I would like to try t5-small checkpoint instead.