@patrickvonplaten I was just wondering if you could share any benchmarking or information on the tiny reformer/longformer models you trained. Which models are they distillations of? Have you benchmarked their performance at all?
I am looking to do something similar but was hoping to get the details of these models before progressing.
I’m also wondering if you have any insight into why bert-base is so often used as the teacher model for the DistillBERT/TinyBERT models. I saw one paper on Robeta that really suggested teaching from a large model would make more sense, I believe.
AFAIK, the tiny reformer and longformer models are not distilled but randomly created smaller models for testing purpose, not meant to be used for training
hm the Tiny signification usually implies distillation. How did you learn this?