Korean-GPT
There are already published Korean GPT2 models, but these models are limited in their use in the industry due to license restrictions. Recently, a large number of South Korean public natural language datasets such as KLUE dataset, AI-Hub dataset, and Modu Corpus have been additionally released. My goal is to train a Korean GPT2 model that anyone can use for any purpose, using only large publicly accessible datasets, including recently added datasets.
2. Language
The model will be trained in Korean
3. Model
GPT2
4. Datasets
KLUE Dataset
KorQuAD Dataset
Modu Corpus(모두의 말뭉치)
Korean Text Datasets from AI-hub
+ Maybe extra publicly accessible Korean datasets…
- https://klue-benchmark.com
- https://korquad.github.io
- https://corpus.korean.go.kr
- https://aihub.or.kr
5. Training scripts
A causal language modeling script for Flax is available here
6. Challenges
It will take time and effort to preprocess each of the various public Korean datasets in the most helpful form for pre-training.