hi , i m making a seq2seq model with attention for multilinguage translation , i trained a sentenspeace tokenizer on my data but after training when passing unlearned sentences for translation i got this poor translation:
Source: good morning
Translation: bonne
Source: give me your email
Translation: donne-moi ton email
Source: give me your email please
Translation: donne-moi ton e sil vous plaît
i m using sentencepeace tokenizer like this import sentencepiece as spm
Train the SentencePiece tokenizer
spm.SentencePieceTrainer.train(
input=‘spm_training_data.txt’,
model_prefix=‘multilingual_tokenizer’,
vocab_size=32000,
character_coverage=0.9995,
model_type=‘bpe’,
bos_id=1,
eos_id=2,
pad_id=0,
unk_id=3,
user_defined_symbols=[“<en_US>”, “<fr_FR>”],
split_by_whitespace=True
)
Load the trained tokenizer
sp = spm.SentencePieceProcessor()
sp.load(‘multilingual_tokenizer.model’)
should i used a pretrained tokenizer like mbart50 tokenizer, is it ok to use a pretrained tokenizer for my
own build seq2seq model with attention .i am concerned about the huge vocab size of 250k,what should i do .