Potentially try out a number of model architectures (t5/roberta/gpt2/bigbird…) using datasets such as Oscar, mc5, gdelt,… & test with a fine-tuned task
I really like this idea, especially since models on Swahili are quite sparse! Do you think we could settle on one model? It might make the project much easier
Guess wanted to keep it open if anybody else joined and was keen on a particular Swahili downstream task but yes could settle on a single transformer architecture.
Alright, let me join you in this project - we have way too little model in Swahili so I’m happy trying to help you here
Would be awesome if we manage to find other people to join this project - otherwise It’ll be use two
I think, first we should decide on a model architecture. I would suggest either BERT or GPT2. If we stick to BERT we should also try to find some good downstream data to fine-tune the model on
And it would be great to find some good datasets in Swahili as well
Feel free to also open a discord, we can chat there for more details
will continue to add them here - Flax Swahili Pretraining - Google Sheets
I will also join this, please add me to the project.
I was hoping to train RoBERTa or other MLMs on another East African language, Tigrinya, which is spoken in Eritrea and Northern Ethiopia and far less represented than Swahili, but it will be great to join forces here and learn with you guys. Or maybe we can consider both, if time allows.
Awesome added you!