There are already published Korean GPT2 models, but these models are limited in their use in the industry due to license restrictions. Recently, a large number of South Korean public natural language datasets such as KLUE dataset, AI-Hub dataset, and Modu Corpus have been additionally released. My goal is to train a Korean GPT2 model that anyone can use for any purpose, using only large publicly accessible datasets, including recently added datasets.
The model will be trained in Korean
Modu Corpus(모두의 말뭉치)
Korean Text Datasets from AI-hub
+ Maybe extra publicly accessible Korean datasets…
A causal language modeling script for Flax is available here
It will take time and effort to preprocess each of the various public Korean datasets in the most helpful form for pre-training.