Steps to train T5 on collections of tags

Hiya! I’m working on my own model of Imagen; Instead of sentence prompts, my image-pair dataset uses a series of tags to describe an image. An example would be (without quotes): “sunny_day park dog parked_motorcycle female_walking” - and there can be anywhere from a few tags to 30+ tags per image. Because Imagen uses T5 to generate embeddings, I’d need to train a T5 model from scratch based on these collections of tags instead of using transfer learning, correct? Would these tags need to be presented as an array of strings, or one large string? What else would I need to do? And if it’s possible to answer: If my dataset was about 250K, how long would it take to train a T5 large on this dataset on either a P100 or latest generation TPU? Thanks for the help!