Pretrain T5 for Chinese
Currently there is No T5 pretrained model for Chinese language on the Huggingface hub. The goal is to train a T5-base model in Chinese and finetune it on Chinese QA dataset CMRC2018 to see its performance compare to other non-T5 model.
Model
A randomly initialized T5 model
Datasets
CLUECorpus2020
OSCAR
mC4
CMRC2018 (CMRC 2018)
Available training scripts
Will be using standard T5 training script from Huggingface
(Optional) Desired project outcome
A ChineseT5 model that is competitive with other similar size/speed non-T5 model (Roberta, GPTs etc) on Chinese QA dataset
(Optional) Challenges
Chinese character is non-spaced, might need special treatment compare to spaced language like English.