Can I retrain GPT-2 tokeniser on Chinese data and use it with GPT-2 XL or other models to create a Chinese-speaking model?

Hi everyone,

I’m trying to retrain the GPT-2 tokeniser on a vast amount of Chinese data to adapt it to the Chinese language. My goal is to use this retrained tokeniser with a pre-trained GPT-2 XL or other models to create a Chinese-speaking model.

My question is: will this approach work? Can I simply retrain the GPT-2 tokeniser on Chinese data and then use it with a pre-trained GPT-2 XL or other models to create a Chinese-speaking model? Or are there additional steps I need to take to ensure the model can understand and generate Chinese text?

I’ve tried searching for similar questions and tutorials, but I couldn’t find any clear guidance on this specific topic. I’d appreciate any insights or advice from the community on whether this approach is feasible and what additional steps I might need to take to make it work.

Some specific questions I have are:

  • Will the retrained tokeniser be compatible with the pre-trained GPT-2 XL or other models?
  • Are there any specific preprocessing steps I need to take when working with Chinese text?
  • Are there any known issues or limitations when using a retrained tokeniser with pre-trained models?

Any help or guidance would be greatly appreciated!