Let’s combine ByT5, mT5, GPT-Neo and DistilBert!
For initial pre-training, a randomly initialized ByT5 model which we distil after pre-training is completed.
Training scripts will be created as part of the project
A Distilled model that combines the power of T5 and GPT-Neo while removing the need for tokenization.
UPDATE: We could just go for it and implement this with rotary embeddings as well.
We could kick this up a notch further with rotary embeddings…DistilmRoByT5Neo?
Finalizing since another team member will join @wolosonovich it would be great if you could post the hub name of your team member here once you have it
I will do that, thank you very much @patrickvonplaten !
@patrickvonplaten can you add @vmazelis to the spreadsheet. he is a part of our team as well for this project.
@patrickvonplaten can you update the spreadsheet and replace Brett with @bneb10 when you have a chance? thanks so much!