1. RobIt
The idea is to use the Italian portion of mC4 (which is about 590 GB) and OSCAR dataset(which is about 137 GB) to pre-train a RoBERTa-base model.
2. Language
The model will be trained in Italian
3. Model used
ROBERTa-base
4. Datasets
OSCAR-Italian, Italian Portion of multilingual C4 dataset
5.Training Scripts:
There are already Flax scripts to pre-train RoBERTa that we can easily use:
6. Challenges:
- It is too much data. We need a way to reduce the amount of data to finish on time so we will use random sampling.
- Achieving SOTA.
7. Desired Possible Outcome:
- A Italian monolingual well performing model. We could optionally test the validity of our results at a later stage by fine-tuning the model on some downstream tasks.
8. Reads
- https://arxiv.org/pdf/1907.11692.pdf
- We found a model trained on RoBERTa but we plan on increasing the dataset with a different tokenisation.