T5 model does not recognize characters (e.g: à, ê, í, ó, ô)

Hi,

I have been able to fine-tune a t5 model for SQL to Natural Language task, however the Natural language was in english.

Now, what I have been trying to do is fine-tune a t5 model to convert a word in syllabes, for example:

joão → jo|ão|

camarão → ca|ma|rão|

The problem, is that since the t5-base model does not recognize characters with accents, it will output an ?? instead of ã.

I also have have tried out mt5 (based on the github from this article: https://towardsdatascience.com/how-to-train-an-mt5-model-for-translation-with-simple-transformers-30ba5fa66c5f), and despite mt5 being able to handle the non-english characters, it seems like, however I am not 100% sure, that the models uses only the vocabulary it has seen during training. Can anyone confirm this ? If it does, then it is useless in my case, cause the model I am trying to train should be able to decompose unseen words in syllabes.

If anyone can point out a solution for my problem, I would be very grateful. T5 would do the job if it recognized: â, ó , ã , ê , í etc…

Thanks a lot

2 Likes

Hi Muradean,
I’m facing the same problem, if there is anyone that can help with this it would be very helpful.