T5 model does not recognize characters (e.g: à, ê, í, ó, ô)

Muradean · May 6, 2021, 1:54am

Hi,

I have been able to fine-tune a t5 model for SQL to Natural Language task, however the Natural language was in english.

Now, what I have been trying to do is fine-tune a t5 model to convert a word in syllabes, for example:

joão → jo|ão|

camarão → ca|ma|rão|

The problem, is that since the t5-base model does not recognize characters with accents, it will output an ?? instead of ã.

I also have have tried out mt5 (based on the github from this article: https://towardsdatascience.com/how-to-train-an-mt5-model-for-translation-with-simple-transformers-30ba5fa66c5f), and despite mt5 being able to handle the non-english characters, it seems like, however I am not 100% sure, that the models uses only the vocabulary it has seen during training. Can anyone confirm this ? If it does, then it is useless in my case, cause the model I am trying to train should be able to decompose unseen words in syllabes.

If anyone can point out a solution for my problem, I would be very grateful. T5 would do the job if it recognized: â, ó , ã , ê , í etc…

Thanks a lot

tgh · May 28, 2022, 8:18pm

Hi Muradean,
I’m facing the same problem, if there is anyone that can help with this it would be very helpful.

Topic		Replies	Views
Query about Text model - T5 Models	0	169	November 23, 2023
T5 available languages Models	0	281	August 30, 2021
Yet another question about T5 prefixes: are they special? Models	0	980	May 28, 2021
Finetune t5 for English-Vietnamese translation 🤗Transformers	2	1092	May 28, 2022
T5 omits some characters 🤗Transformers	1	121	March 12, 2024

T5 model does not recognize characters (e.g: à, ê, í, ó, ô)

Related topics