Pre-train RoBERTa from Scratch for Georgian Language

Pre-train RoBERTa from Scratch for Georgian Language

Currently, there are no open-source language models for the Georgian language. I have not so large dataset which I want to use for pre-training RoBERTa for the Georgian language from scratch.

2. Language

The model will be trained in Georgian Language

3. Model

RoBERTa

4. Datasets

wikipedia dump
Common Crawl dump
random web scraps

5. Training scripts

There are already Flax scripts to pre-train RoBERTa that we can easily use:

transformers/examples/flax/language-modeling at master · huggingface/transformers · GitHub)

cool defined it!