Pretrain GPT2 from scratch in Korean


There are already published Korean GPT2 models, but these models are limited in their use in the industry due to license restrictions. Recently, a large number of South Korean public natural language datasets such as KLUE dataset, AI-Hub dataset, and Modu Corpus have been additionally released. My goal is to train a Korean GPT2 model that anyone can use for any purpose, using only large publicly accessible datasets, including recently added datasets.

2. Language

The model will be trained in Korean

3. Model


4. Datasets

KLUE Dataset
KorQuAD Dataset
Modu Corpus(모두의 말뭉치)
Korean Text Datasets from AI-hub
+ Maybe extra publicly accessible Korean datasets…

5. Training scripts

A causal language modeling script for Flax is available here

6. Challenges

It will take time and effort to preprocess each of the various public Korean datasets in the most helpful form for pre-training.


I think this will be an amazing challenge! Wish I can join this project!


Awesome finalizing this project :slight_smile:

! important

Among the above datasets, I will correct the dataset that I incorrectly labeled as Modu to the official name "NIKL Corpus(국립국어원 모두의 말뭉치) ".