Pretrain GPT-2 from scratch in Thai

GPT2 for Thai

The goal is to create a strong language generation model for Thai :thailand:
Since initial code and data are pretty much written by @patrickvonplaten and other huggingface members, it should not be so hard to get the first sense.


Randomly initialized GPT2 model


We can use OSCAR which is available through datasets


A causal language modeling script for Flax is available here . It can be used pretty much without any required code changes.
If there is time left, Iā€™d love to try some private crawling and integrate it datasets.

Expected Outcome

Understandable Thai text generation model


Lack of data ā†’ The OSCAR dataset might be too small (it has < 20GB of data for Thai). It would be better to collect and accumulate to bigger dataset.


Team Members