Questions on distilling [from] T5

For which task?

I don’t have any particular task in mind, yet. Just exploring for now.

There is no feasability issue.

I see … thanks for clarifying it.

I just got excited about bart/pegasus since it performed the best in my summarization experiments

Are you suggesting that you got better results with BART, compared with T5?

Re. distilling TPUs: I guess one limitation here is that T5(11B, the teacher) would not fit in many common GPUs; right? I wonder it is possible to pre-extract the teacher logics (say, on a TPU) and just load them in the distiller code. Do you have any thoughts on this issue, @sshleifer?