@patrickvonplaten I was just wondering if you could share any benchmarking or information on the tiny reformer/longformer models you trained. Which models are they distillations of? Have you benchmarked their performance at all?
I am looking to do something similar but was hoping to get the details of these models before progressing.
I’m also wondering if you have any insight into why bert-base is so often used as the teacher model for the DistillBERT/TinyBERT models. I saw one paper on Robeta that really suggested teaching from a large model would make more sense, I believe.
AFAIK, the tiny reformer and longformer models are not distilled but randomly created smaller models for testing purpose, not meant to be used for training