Pretrain T5 from scratch in Chinese

Pretrain T5 for Chinese
Currently there is No T5 pretrained model for Chinese language on the Huggingface hub. The goal is to train a T5-base model in Chinese and finetune it on Chinese QA dataset CMRC2018 to see its performance compare to other non-T5 model.

Model
A randomly initialized T5 model

Datasets
CLUECorpus2020
OSCAR
mC4
CMRC2018 (CMRC 2018)

Available training scripts
Will be using standard T5 training script from Huggingface

(Optional) Desired project outcome

A ChineseT5 model that is competitive with other similar size/speed non-T5 model (Roberta, GPTs etc) on Chinese QA dataset

(Optional) Challenges

Chinese character is non-spaced, might need special treatment compare to spaced language like English.

1 Like

sounds good - let’s define it!