RobIt : PreTrain RoBERTa-base from scratch in Italian

1. RobIt

The idea is to use the Italian portion of mC4 (which is about 590 GB) and OSCAR dataset(which is about 137 GB) to pre-train a RoBERTa-base model.

2. Language

The model will be trained in Italian

3. Model used

ROBERTa-base

4. Datasets

OSCAR-Italian, Italian Portion of multilingual C4 dataset

5.Training Scripts:

There are already Flax scripts to pre-train RoBERTa that we can easily use:

Google Colaboratory

6. Challenges:

  • It is too much data. We need a way to reduce the amount of data to finish on time so we will use random sampling.
  • Achieving SOTA.

7. Desired Possible Outcome:

  • A Italian monolingual well performing model. We could optionally test the validity of our results at a later stage by fine-tuning the model on some downstream tasks.

8. Reads

2 Likes

Count me in !!!

1 Like

Count me in :grinning_face_with_smiling_eyes:

1 Like

Would love to be a part of this, count me in too!

1 Like

Great! Finalizing this project!

1 Like