Sentenspeace tokenizer or mbart50

ahmed1488 · April 18, 2025, 11:28am

hi , i m making a seq2seq model with attention for multilinguage translation , i trained a sentenspeace tokenizer on my data but after training when passing unlearned sentences for translation i got this poor translation:

Source: good morning
Translation: bonne

Source: give me your email
Translation: donne-moi ton email

Source: give me your email please
Translation: donne-moi ton e sil vous plaît

i m using sentencepeace tokenizer like this import sentencepiece as spm

Train the SentencePiece tokenizer

spm.SentencePieceTrainer.train(
input=‘spm_training_data.txt’,
model_prefix=‘multilingual_tokenizer’,
vocab_size=32000,
character_coverage=0.9995,
model_type=‘bpe’,
bos_id=1,
eos_id=2,
pad_id=0,
unk_id=3,
user_defined_symbols=[“<en_US>”, “<fr_FR>”],
split_by_whitespace=True
)

Load the trained tokenizer

sp = spm.SentencePieceProcessor()
sp.load(‘multilingual_tokenizer.model’)

should i used a pretrained tokenizer like mbart50 tokenizer, is it ok to use a pretrained tokenizer for my
own build seq2seq model with attention .i am concerned about the huge vocab size of 250k,what should i do .

Topic		Replies	Views
SentencePiece to Tokenizers conversion 🤗Tokenizers	0	80	March 14, 2025
Issue with MBart50 translation Beginners	2	622	February 24, 2021
DataCollator for training mbart50 for translation with custom dataset Beginners	0	346	June 24, 2021
Subword regularization in Sentencepiece and DeBERTaV2 tokenizers (not working) 🤗Transformers	0	695	February 1, 2023
Training sentencePiece from scratch? 🤗Tokenizers	8	19259	December 19, 2023

Sentenspeace tokenizer or mbart50

Train the SentencePiece tokenizer

Load the trained tokenizer

Related topics