For which task?
I don’t have any particular task in mind, yet. Just exploring for now.
There is no feasability issue.
I see … thanks for clarifying it.
I just got excited about bart/pegasus since it performed the best in my summarization experiments
Are you suggesting that you got better results with BART, compared with T5?
Re. distilling TPUs: I guess one limitation here is that T5(11B, the teacher) would not fit in many common GPUs; right? I wonder it is possible to pre-extract the teacher logics (say, on a TPU) and just load them in the distiller code. Do you have any thoughts on this issue, @sshleifer?