Can I retrain GPT-2 tokeniser on Chinese data and use it with GPT-2 XL or other models to create a Chinese-speaking model?

sidharthsajith7 · August 14, 2024, 11:32am

Hi everyone,

I’m trying to retrain the GPT-2 tokeniser on a vast amount of Chinese data to adapt it to the Chinese language. My goal is to use this retrained tokeniser with a pre-trained GPT-2 XL or other models to create a Chinese-speaking model.

My question is: will this approach work? Can I simply retrain the GPT-2 tokeniser on Chinese data and then use it with a pre-trained GPT-2 XL or other models to create a Chinese-speaking model? Or are there additional steps I need to take to ensure the model can understand and generate Chinese text?

I’ve tried searching for similar questions and tutorials, but I couldn’t find any clear guidance on this specific topic. I’d appreciate any insights or advice from the community on whether this approach is feasible and what additional steps I might need to take to make it work.

Some specific questions I have are:

Will the retrained tokeniser be compatible with the pre-trained GPT-2 XL or other models?
Are there any specific preprocessing steps I need to take when working with Chinese text?
Are there any known issues or limitations when using a retrained tokeniser with pre-trained models?

Any help or guidance would be greatly appreciated!

Topic		Replies	Views
Fine tuning and retokenizing Beginners	0	589	May 29, 2022
How to fine tune gpt2 for chinese sentences 🤗Transformers	0	276	February 16, 2023
GPT2 Training from scratch in German 🤗Transformers	3	2312	October 3, 2020
Training GPT-2 from scratch Beginners	2	1230	August 3, 2020
Fine Tuning GPT2 for machine translation 🤗Transformers	1	4774	May 2, 2021

Can I retrain GPT-2 tokeniser on Chinese data and use it with GPT-2 XL or other models to create a Chinese-speaking model?

Related topics