Pretrain T5 from scratch in Chinese

chewkokwah · July 7, 2021, 7:53am

Pretrain T5 for Chinese
Currently there is No T5 pretrained model for Chinese language on the Huggingface hub. The goal is to train a T5-base model in Chinese and finetune it on Chinese QA dataset CMRC2018 to see its performance compare to other non-T5 model.

Model
A randomly initialized T5 model

Datasets
CLUECorpus2020
OSCAR
mC4
CMRC2018 (CMRC 2018)

Available training scripts
Will be using standard T5 training script from Huggingface

(Optional) Desired project outcome

A ChineseT5 model that is competitive with other similar size/speed non-T5 model (Roberta, GPTs etc） on Chinese QA dataset

(Optional) Challenges

Chinese character is non-spaced, might need special treatment compare to spaced language like English.

patrickvonplaten · July 7, 2021, 10:30am

sounds good - let’s define it!

Topic		Replies	Views
PreTrain T5 from scratch in Bengali Flax/JAX Projects	5	2205	July 26, 2022
Example of how to pretrain T5? 🤗Transformers	15	15995	March 16, 2023
Train T5 decoder only on a different language Models	0	449	March 16, 2021
How is T5 pretrained? 🤗Transformers	3	510	July 12, 2021
Pretrain T5 for Arabic Flax/JAX Projects	17	2684	June 11, 2023

Pretrain T5 from scratch in Chinese

Related topics