T5 evaluation via Trainer `predict_with_generate` extremely slow on TPU?

Here is a Colab notebook demonstrating the issue Google Colab

After the initial period of the XLA compilation, training proceeds quickly. When the evaluation rolls around at the end of an epoch it’s extremely slow. I assumed perhaps there was just another initial period of slowness, but after 25 minutes the evaluation time estimation says 6 hours. For reference, I completed a single evaluation period on a P100 in ~9 minutes.

I found an old notebook by @valhalla Google Colab where he says,

Second, for some reason which I couldn’t figure out, the .generate method is not working on TPU so will need to do prediction on CPU.

It’s unclear to me whether “not working” means not at all or whether he had the same issue.

Anyone run into this issue?