I’m tested all Flan-T5 tokenizers on flores200 dataset and found that tokenizers dont’t work on many languages from supported languages list from model card.
from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
# for example sentence No. 30 from flores200 devtest - eng_Latn.devtest - The find also grants insight into the evolution of feathers in birds.
input_texts={
"Arabic":"يمنح الاكتشاف أيضاً نظرة على تطور الريش في الطيور.",
"Korean":"그러한 발견은 또한 조류에 있어서 깃털의 진화에 대한 통찰을 제공한다.",
"Lithuanian":"Šis atradimas taip pat suteikia įžvalgų apie plunksnų evoliuciją paukščiuose.",
"Russian":"Находка также позволяет ознакомиться с эволюцией перьев у птиц.",
"Greek":"Το εύρημα παρέχει επίσης πληροφορίες για την εξέλιξη του φτερώματος στα πτηνά.",
"Persian":"این یافته همچنین شناخت ما را در مورد تکامل پرها در پرندگان بیشتر میکند.",
"Hebrew":"הממצא גם נותן תובנות בנוגע לאבולוציה של נוצות אצל ציפורים.",
# etc...
}
for lang in input_texts:
output = tokenizer.encode(input_texts[lang], add_special_tokens=True, return_tensors="pt")
print(lang,"\n\t",output[0],"\n\t",tokenizer.decode(output[0]),"\n")
Outputs:
Arabic
tensor([3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 5, 1])
<unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>.</s>
Korean
tensor([3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 5, 1])
<unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>.</s>
Lithuanian
tensor([ 3, 2, 159, 44, 5883, 2754, 3, 17, 9, 23,
102, 6234, 21285, 9069, 9, 3, 2, 2165, 122, 2,
3, 9, 8082, 4752, 6513, 7, 29, 2, 3, 15,
4571, 23, 6809, 354, 2, 2576, 1598, 2, 23, 76,
32, 7, 15, 5, 1])
<unk> is atradimas taip pat suteikia <unk> valg<unk> apie plunksn<unk> evoliucij<unk> pauk<unk> iuose.</s>
Russian
tensor([ 3, 2, 2533, 2, 17238, 12095, 3, 15517, 6652, 2,
1757, 3, 2, 2044, 2, 6609, 17674, 2, 15042, 3,
2044, 2, 8194, 12377, 21325, 6725, 2, 5345, 2, 12681,
3, 2, 6609, 17674, 2, 2795, 1757, 2, 3, 2,
17657, 2, 1757, 6609, 3, 3700, 3, 2, 18352, 2,
5, 1])
<unk> а<unk> одка так<unk> е <unk> о<unk> вол<unk> ет о<unk> накомит<unk> с<unk> с <unk> вол<unk> ие<unk> <unk> ер<unk> ев у <unk> ти<unk>.</s>
Greek
tensor([3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2,
5, 1])
<unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>.</s>
Persian
tensor([3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2,
3, 2, 3, 2, 3, 2, 5, 1])
<unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>.</s>
Hebrew
tensor([3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 5, 1])
<unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>.</s>
mT5 tokenizers works fine on above examples:
tokenizer = T5Tokenizer.from_pretrained("google/mt5-xl")
...
Outputs:
Arabic
tensor([ 259, 477, 76466, 402, 61441, 67182, 1021, 17569, 1093,
4660, 445, 259, 942, 766, 3772, 402, 201949, 575,
9554, 110474, 260, 1])
يمنح الاكتشاف أيضاً نظرة على تطور الريش في الطيور.</s>
Korean
tensor([ 259, 19393, 839, 11792, 17284, 869, 259, 7073, 839,
7830, 14567, 873, 3083, 39514, 259, 212777, 111731, 649,
19420, 3257, 873, 9153, 8004, 39049, 611, 17054, 8535,
260, 1])
그러한 발견은 또한 조류에 있어서 깃털의 진화에 대한 통찰을 제공한다.</s>
Lithuanian
tensor([ 9117, 263, 9928, 179821, 259, 10386, 3777, 517, 55744,
262, 1700, 117403, 1014, 259, 8138, 421, 65166, 263,
35708, 259, 78500, 273, 130242, 555, 101788, 225010, 260,
1])
Šis atradimas taip pat suteikia įžvalgų apie plunksnų evoliuciją paukščiuose.</s>
Russian
tensor([ 1051, 9240, 679, 922, 1108, 11426, 12960, 259, 80281,
5477, 388, 3604, 223753, 1011, 22435, 40927, 456, 259,
68838, 260, 1])
Находка также позволяет ознакомиться с эволюцией перьев у птиц.</s>
Greek
tensor([ 3441, 259, 11025, 233352, 7125, 3896, 6647, 4945, 29515,
4714, 901, 259, 640, 5011, 72670, 8029, 694, 7270,
51065, 67106, 4078, 1445, 640, 1172, 260, 1])
Το εύρημα παρέχει επίσης πληροφορίες για την εξέλιξη του φτερώματος στα πτηνά.</s>
Persian
tensor([ 953, 259, 14594, 376, 1373, 3054, 11805, 259, 48063, 1415,
916, 509, 259, 7352, 6077, 20473, 1197, 913, 509, 1197,
33244, 259, 11732, 822, 5606, 260, 1])
این یافته همچنین شناخت ما را در مورد تکامل پرها در پرندگان بیشتر می کند.</s>
Hebrew
tensor([ 2257, 13535, 259, 2730, 259, 4075, 936, 7838, 55209,
26114, 14243, 1282, 35043, 109702, 580, 752, 32830, 19123,
882, 14425, 17169, 94437, 260, 1])
הממצא גם נותן תובנות בנוגע לאבולוציה של נוצות אצל ציפורים.</s>
@olivierdehaene @ybelkada please check and fix tokenizer, or edit incorrect information about supporting 60 languages from models cards and their labels.