Sentenspeace tokenizer or mbart50

hi , i m making a seq2seq model with attention for multilinguage translation , i trained a sentenspeace tokenizer on my data but after training when passing unlearned sentences for translation i got this poor translation:

Source: good morning
Translation: bonne

Source: give me your email
Translation: donne-moi ton email

Source: give me your email please
Translation: donne-moi ton e sil vous plaît

i m using sentencepeace tokenizer like this import sentencepiece as spm

Train the SentencePiece tokenizer

spm.SentencePieceTrainer.train(
input=‘spm_training_data.txt’,
model_prefix=‘multilingual_tokenizer’,
vocab_size=32000,
character_coverage=0.9995,
model_type=‘bpe’,
bos_id=1,
eos_id=2,
pad_id=0,
unk_id=3,
user_defined_symbols=[“<en_US>”, “<fr_FR>”],
split_by_whitespace=True
)

Load the trained tokenizer

sp = spm.SentencePieceProcessor()
sp.load(‘multilingual_tokenizer.model’)

should i used a pretrained tokenizer like mbart50 tokenizer, is it ok to use a pretrained tokenizer for my
own build seq2seq model with attention .i am concerned about the huge vocab size of 250k,what should i do .

1 Like