Hi everyone,
I’m trying to retrain the GPT-2 tokeniser on a vast amount of Chinese data to adapt it to the Chinese language. My goal is to use this retrained tokeniser with a pre-trained GPT-2 XL or other models to create a Chinese-speaking model.
My question is: will this approach work? Can I simply retrain the GPT-2 tokeniser on Chinese data and then use it with a pre-trained GPT-2 XL or other models to create a Chinese-speaking model? Or are there additional steps I need to take to ensure the model can understand and generate Chinese text?
I’ve tried searching for similar questions and tutorials, but I couldn’t find any clear guidance on this specific topic. I’d appreciate any insights or advice from the community on whether this approach is feasible and what additional steps I might need to take to make it work.
Some specific questions I have are:
- Will the retrained tokeniser be compatible with the pre-trained GPT-2 XL or other models?
- Are there any specific preprocessing steps I need to take when working with Chinese text?
- Are there any known issues or limitations when using a retrained tokeniser with pre-trained models?
Any help or guidance would be greatly appreciated!