Pretrain GPT2 from scratch in Korean

TekFell · July 1, 2021, 12:37pm

Korean-GPT

There are already published Korean GPT2 models, but these models are limited in their use in the industry due to license restrictions. Recently, a large number of South Korean public natural language datasets such as KLUE dataset, AI-Hub dataset, and Modu Corpus have been additionally released. My goal is to train a Korean GPT2 model that anyone can use for any purpose, using only large publicly accessible datasets, including recently added datasets.

2. Language

The model will be trained in Korean

3. Model

GPT2

4. Datasets

KLUE Dataset
KorQuAD Dataset
Modu Corpus(모두의 말뭉치)
Korean Text Datasets from AI-hub
+ Maybe extra publicly accessible Korean datasets…

5. Training scripts

A causal language modeling script for Flax is available here

6. Challenges

It will take time and effort to preprocess each of the various public Korean datasets in the most helpful form for pre-training.

ben9004 · July 1, 2021, 12:54pm

I think this will be an amazing challenge! Wish I can join this project!

patrickvonplaten · July 1, 2021, 11:55pm

Awesome finalizing this project

TekFell · July 16, 2021, 4:29pm

! important

Among the above datasets, I will correct the dataset that I incorrectly labeled as Modu to the official name "NIKL Corpus(국립국어원 모두의 말뭉치) ".

Topic		Replies	Views
PreTrain GPT-2 from scratch for German on novel GC4 dataset Flax/JAX Projects	7	1200	July 2, 2021
PreTrain GPT2 from scratch in Indonesia Flax/JAX Projects	13	760	June 30, 2021
Pretrained GPT2 for Tamil Flax/JAX Projects	13	1086	July 12, 2021
PreTrain GPT2 from scratch in Swedish Flax/JAX Projects	4	982	June 29, 2021
PreTrain GPT2 from scratch in Punjabi Flax/JAX Projects	2	415	June 29, 2021