Pre-train DistilmByT5Neo

Pre-train DistilmByT5Neo
Let’s combine ByT5, mT5, GPT-Neo and DistilBert!

Model
For initial pre-training, a randomly initialized ByT5 model which we distil after pre-training is completed.

Datasets
c4multilingual
The Pile

Training Scripts
Training scripts will be created as part of the project

Expected Result
A Distilled model that combines the power of T5 and GPT-Neo while removing the need for tokenization.

UPDATE: We could just go for it and implement this with rotary embeddings as well.

We could kick this up a notch further with rotary embeddings…DistilmRoByT5Neo?

Finalizing since another team member will join :slight_smile: @wolosonovich it would be great if you could post the hub name of your team member here once you have it

I will do that, thank you very much @patrickvonplaten !

@patrickvonplaten can you add @vmazelis to the spreadsheet. he is a part of our team as well for this project.

1 Like

should be done :slight_smile:

@patrickvonplaten can you update the spreadsheet and replace Brett with @bneb10 when you have a chance? thanks so much!

link to our discord server Flax-HuggingFace-Community-Week