@lewtun @valhalla @nielsr @patrickvonplaten I am planing to pretrain multilingual T5 small and/or medium from scratch, i can across this post and the hugginface implementation for T5, my question is can i use the same pretraining script from T5 , by replace the T5Config with mT5Config ? WOULD THIS WORK ?
Also how should the dataset be arranged for multilingual languages pretraining ? should all the langages be arranged in a sequential order where a sequence of one lang followed by another eg: [French, German, Italian] or should all the languages be randomly shuffled ?
for the record i am planning to pretrain mT5 on indian languages on the oscar corpus and some additionally sourced text corpus.