GPT2 for Indonesia
Currently, there is only few GPT2 model that was trained from scratch for Indonesian (small size only) on the hub: Hugging Face – The AI community building the future. . For this project, the goal is to create a strong language generation model for Indonesia.
Model
A randomly initialized GPT2 model
Datasets
One can make use OSCAR the dataset is also available through the datasets
library here: oscar · Datasets at Hugging Face.
A portion mc4 [MC4] (The C4 Multilingual Dataset · Discussion #5265 · allenai/allennlp · GitHub) and
CC100 [CC-100: Monolingual Datasets from Web Crawl Data] can also be used for this task
Available training scripts
A causal language modeling script for Flax is available here . It can be used pretty much without any required code changes.
(Optional) Desired project outcome
The desired project output is a GPT2 model that is able to generate Indonesia language.
(Optional) Challenges
The dataset on MC4 and CC100 is quite large → around 350 GB. One might want to try out dataset streaming to solve this problem. Dataset streaming will be merged to datasets
in a couple of days. See PR here and the docs here.
Since mc4 not available in datasets yet, adjustment for streaming dataset could be required
(Optional) Links to read upon
The most important read would be the following colab: