Keeping some tokens untranslated

shachar · October 15, 2020, 5:35pm

Hello everyone,

I have a question about the MarianMT implementation in transformers.

I’m looking for a way to prevent certain tokens from being translated, like mentions or URLs in tweets.

I’ve tried using the special_tokens and additional_special_tokens , but they are not preserved and I get some weird outputs. The tokenization seems right and the tags are treated as a single special token by the relevant tokenizer, but the translation gives unexpected results.

To make it more concrete, here’s an example of my input, where I’d like the tags to be preserved (in this case without using any special tokens):

here’s my email <EMAIL> -> <MEMAIL> هذا هو بريدي الإلكتروني

Other tags are changed as well. In French, for example, <EMAIL> is sometimes translated to <MAIL>

I’ve managed to hack something, by replacing the tag with a string that seems to be preserved at least for French (e.g. ABCDE ) and replacing it back later, but I’d like to do it the right way, also to be sure that it works for all language pairs I need to apply it on.

The models I tried are Helsinki-NLP/opus-mt-en-fr (or en-ar), transformers version: 3.3.1.

Any advice? thanks,

Shachar

Topic		Replies	Views
Keeping special chars in translations 🤗Tokenizers	0	303	October 12, 2023
How to skip tokens from translation? 🤗Tokenizers	2	887	October 15, 2024
Issue with using a save_pretrained model (MarianMT) 🤗Transformers	1	447	April 5, 2023
Opus-MT: Translation returns <unk> token Models	3	15	June 6, 2025
Looking for translation mechanism (es-en,en-es) 🤗Transformers	1	534	August 10, 2020

Keeping some tokens untranslated

Related topics