Keeping some tokens untranslated

Hello everyone,

I have a question about the MarianMT implementation in transformers.

I’m looking for a way to prevent certain tokens from being translated, like mentions or URLs in tweets.

I’ve tried using the special_tokens and additional_special_tokens , but they are not preserved and I get some weird outputs. The tokenization seems right and the tags are treated as a single special token by the relevant tokenizer, but the translation gives unexpected results.

To make it more concrete, here’s an example of my input, where I’d like the tags to be preserved (in this case without using any special tokens):

here’s my email <EMAIL> -> <MEMAIL> هذا هو بريدي الإلكتروني

Other tags are changed as well. In French, for example, <EMAIL> is sometimes translated to <MAIL>

I’ve managed to hack something, by replacing the tag with a string that seems to be preserved at least for French (e.g. ABCDE ) and replacing it back later, but I’d like to do it the right way, also to be sure that it works for all language pairs I need to apply it on.

The models I tried are Helsinki-NLP/opus-mt-en-fr (or en-ar), transformers version: 3.3.1.

Any advice? thanks,

Shachar

2 Likes