How to constrain mBart decoding to generate English-only output?

kangje · August 31, 2022, 3:56pm

I use mbart conditional generation model from huggingface (here is the link). I use the model to finetune for a multilingual translation task (not exactly a translation task but for the sake of simplicity, I’ll call it a translation task).

I’ve noticed that to translate, for example, from Chinese to English, the model simply copies Chinese input textes to the output, which is an undesirable output for my task (having Chinese characters in the middle of the output text). I want to constrain the model’s decoding so that it only generates subwords with English alphabets.

I use forced_bos_token_id to tell the model to begin the translation with a specific language. However, it constrains only the beginning of the translation and not throughout the entire translation.

model.generate(input_ids=dev_input['input_ids']
               , attention_mask=dev_input['attention_mask']
               , forced_bos_token_id=forced_bos_token_id
               , num_beams=5
               )

I also gave a look at ‘Constrained Decoding’ (here). With this, we can constrain the decoder to generate an output text so that it includes certain token(s) but it does not work to exclude certain tokens (none-English alphabets).

Does anybody know how to force model to generate outputs with English-only characters during the decoding?

Topic		Replies	Views
Force mBART to generate tokens in target language during backtranslation Models	0	489	March 22, 2021
Can we force first token by model.config.forced_bos_token_id? 🤗Transformers	0	659	April 12, 2022
Help with finetuning mBART on an unseen language Models	19	2054	October 30, 2020
Question about Multilingual Tokenizers expected behaviours Beginners	0	326	July 13, 2022
Facebook mbart multilingual translation Beginners	0	499	February 1, 2023

How to constrain mBart decoding to generate English-only output?

Related topics