The LongT5 paper claims LongT5 is compatible with T5 checkpoints.
We experiment with two attention mechanism variations for LongT5 […]: (1) Local Attention and (2) Transient Global Attention(TGlobal).
Both variations preserve several properties of T5: […] compatibility with T5 checkpoints.
How do I accomplish this with transformers?
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = LongT5ForConditionalGeneration.from_pretrained("t5-small")
The console output tells me there is a difference in naming:
- encoder.block.0.layer.0.SelfAttention.o.weight for T5
- encoder.block.0.layer.0.LocalSelfAttention.o.weight for LongT5
Is there a way to map the weight names when loading a checkpoint? Or should I download the checkpoint and modify the files before loading it?
Background:
For my use case, I have long input sequences and frequently run into memory problems. LongT5 looks like a good candidate, but the smallest officially released checkpoint is google/long-t5-<type>-base
. That’s why I would like to try t5-small
checkpoint instead.