ges and sato are likely separate tokens. I think you may be chasing a red herring. What do your loss curves (train and validation) look like? what other metrics do you log? What parameters have you experimented with? experiment with other models too such as XLM-Roberta?