MarianMT training produce "▁" in results

morenolq · March 10, 2021, 12:26pm

Good morning,

I’m trying to fine-tune MarianMT model using parallel sentences (french-english and german-english at the moment).
While the standard model does not include any “strange” symbols in the translated sentences, after a few training iterations the model start to output this symbol → ▁ (it’s not the standard underscore).

The sentences are something like:
He also played three days later against Bolivia. -> Il▁joue aussi▁trois jours plus▁tard▁contre la Bolivie. or
Prenons un peu de recul et demandons-nous, pourquoi enseigne-t-on les maths? -> Take a little back and ask▁ourselves,▁Why are they teaching math?

The same character appears using German/English. Has anyone experienced the same? Am I missing something?

Thank you.

dave-kudo · February 21, 2022, 6:34pm

I’ve run into this recently as well. I believe it’s a SentencePiece artifact, and I’ve just been converting to space in post-processing, but if anyone can shed light on why this artifact occurs when fine-tuning, I’d be keen to learn more.

Topic		Replies	Views
MarianMTModel stops translating on encountering "-" character Models	0	136	October 17, 2023
Matching original and translated words with MarianMT Models	1	1066	May 21, 2021
Issue with using a save_pretrained model (MarianMT) 🤗Transformers	1	447	April 5, 2023
Adding New Tokens to MarianMT Model 🤗Tokenizers	8	758	February 4, 2024
Enhance a MarianMT pretrained model from HuggingFace with more training data Beginners	4	2709	May 29, 2021

MarianMT training produce "▁" in results

Related topics