Encoder-Decoder model only generates bos_token's [<s><s><s>]

Hello,

I fine-tuned bert2bert like model as encoder-decoder for translation task, the model used is microsoft/Multilingual-MiniLM-L12-H384 with XLMRobertaTokenizer,
When I run the fine-tuned model using generate() method, it only generates bos_token until it reaches limit of decoding length,
The decoder_start_token_id is set the same as bos_token_id
I’ve checked some related Github issues, but they were for brat model.

and the solutions proposed didn’t work for my model,

Any suggestions or ideas would be appreciated,
Thanks.

Hi @AbdelrahmanZ , could you provide a script (or the command you used to launch the training) that could reproduce the issue, please?

Actually, the training script is a little bit huge, though I will share the most important parts:
Loading the model:

            tokenizer = XLMRobertaTokenizer.from_pretrained("microsoft/Multilingual-MiniLM-L12-H384")
            tokenizer.bos_token = tokenizer.cls_token
            tokenizer.eos_token = tokenizer.sep_token
            encoder_decoder_model = EncoderDecoderModel.from_encoder_decoder_pretrained(
                "microsoft/Multilingual-MiniLM-L12-H384",
                "microsoft/Multilingual-MiniLM-L12-H384",
            )
            encoder_decoder_model.config.decoder_start_token_id = tokenizer.bos_token_id
            encoder_decoder_model.config.eos_token_id = tokenizer.eos_token_id
            encoder_decoder_model.config.pad_token_id = tokenizer.pad_token_id

Single (source, target) example tokenization:

        model_inputs = self.tokenizer(
            examples["source"].strip(),
            max_length=self.params["encoder_max_length"],
            padding=False,
            truncation=True,
        )
        targets = self.tokenizer(
            examples["target"].strip(),
            max_length=self.params["decoder_max_length"],
            padding=False,
            truncation=True,
        )
        model_inputs["labels"] = targets["input_ids"]

Then I create Seq2SeqTrainer, and train the model.
For inference, this is the generation config:

generation_config = dict(
    max_length=None,
    min_length=None,
    do_sample=False,
    early_stopping=True,
    num_beams=1,
    temperature=1.0,
    top_k=None,
    top_p=None,
    length_penalty=1.0,  # > 1.0 longer sequences, < 1.0 shorter sequences
    num_return_sequences=1,
    max_time=None,  # in seconds
    num_beam_groups=1,
    output_scores=False,
)

call to generate():

    inputs = tokenizer(text, return_tensors="pt")
    input_ids = inputs.input_ids
    generated_texts = model.generate(input_ids, **generation_config)

I hope this is enough to reproduce the issue. Thank you @ydshieh .

Hi, @AbdelrahmanZ Thank you for the effort. It would still be much easier if a single training script could be provided. Do you mind to copy the code and paste in a Google Colab notebook (in a single cell), or even put your script in a GitHub repository. Otherwise, you can send me an email if no other ways fit. Thank you!

Continue the discussion here.

From @ AbdelrahmanZ

I just shared the notebook, I trained the model on Arabic to English translation task. generate() works fine, and the issue doesn’t appear again.
The issue of generating [<s> <s> <s>...] happened with me for training on Arabic to Arabic task. I think it could be related to:

  • The type of source and target languages
  • Maybe it appears only after a large number of training steps
  • I should rerun the finetuning and check the checkpoints gradually, unfortunately this would be painful to me as I need to wait at least two weeks to finish.
1 Like

Hi @AbdelrahmanZ

In the notebook, you have num_train_epochs=2, but you mentioned Maybe it appears only after a large number of training steps? What’s the num_train_epochs you used when you train on Arabic to Arabic (for which you get the issue)?

Would you be able to reproduce with Arabic to Arabic but keeping everything else the same (with the notebook)?

From your description, it seems there is no real bug in our codebase, but it is still interesting to try to figure out why you got the issue from the 1st place - if you can find the situation that can reproduce it, and see if there is indeed something we can/should fix.

Thank you!

Hi @ydshieh

Regarding the num_train_epochs it was just an assumption, I’m not sure if this is the cause.
I will try to rerun the finetuning and modify some hyper-parameters, hopefully the issue will be reproduced.

Hi @ydshieh

I think the issue is reproduced. The issue appeared when the model got initialized from a checkpoint.
Note this didn’t happen on the same model instance that got fine-tuned, only after initializing the model from a checkpoint.
So I believe the steps to reproduce the issue are:

  1. Run the fine-tuning script(the notebook previously shared).
  2. Save a checkpoint.
  3. Run the following script that initialize a new instance of the model from the saved checkpoint.
import torch
from transformers import (
    EncoderDecoderModel,
    XLMRobertaTokenizer,
)
generation_config = dict(
    max_length=None,
    min_length=None,
    do_sample=False,
    early_stopping=True,
    num_beams=1,
    temperature=1.0,
    top_k=None,
    top_p=None,
    length_penalty=1.0,  # > 1.0 longer sequences, < 1.0 shorter sequences
    num_return_sequences=1,
    max_time=None,  # in seconds
    num_beam_groups=1,
    output_scores=False,
)

model_path = "checkpoint-path"
tokenizer = XLMRobertaTokenizer.from_pretrained(model_path)
model = EncoderDecoderModel.from_pretrained(model_path)
model.config.vocab_size = model.config.decoder.vocab_size
model = model.eval()
text = "أجرت فرقة العمل، يشترك في رئاسته جيف موس وآلان Paller، مقابلات مكثفة مع خبراء من الحكومة والقطاع الخاص، والأوساط الأكاديمية في تطوير توصياتها لتنمية المهارات الفنية المتقدمة للأمن السيبراني القوى العاملة DHS وتوسيع خط أنابيب الوطني للرجال والنساء مع هذه المهارات الأمن السيبراني."
inputs = tokenizer(text, return_tensors="pt")
input_ids = inputs.input_ids
generated_texts = model.generate(input_ids, **generation_config)
print(generated_texts)

requirements:

transformers==4.17.0
torch==1.10.1
sentencepiece==0.1.96

I also tried latest transformers version, but still the same behavior.

1 Like

Hi Abdelrahman Alzboon!

Thank you a lot for the effort. Could you post your finding above to the forum thread that you created.
We will definitely look into this issue.

Cheers,
Yih-Dar

Abdelrahman Alzboon via Hugging Face Forums <notifications@hellohellohello.discoursemail.com> 於 2022年11月29日 週二 上午11:45寫道:

I can reproduce the issue with the notebook - with smaller datasets/short sequences. In my case, no need to save and reload, the issue occurs after some training steps.

Will definitely look into this.

1 Like

Hi , @AbdelrahmanZ

Have you tried to

  model_inputs["labels"] = model_inputs["labels"][1:]

after

  model_inputs["labels"] = targets["input_ids"]

inside

preprocess_function

I modified your script a bit for faster debugging, and adding this extra line works for me.

I would suggest you to try the fix with a smaller subset of the dataset, short sequence, fewer steps to make sure it works on your side before launching large training.

Here is the modified notebook

Hi @ydshieh ,

Great catch, unfortunately it requires refine-tuning the model from scratch.
I’ve used the same script before with other bert-like models and it works well.
Is the problem related to fine-tuning and input-preparation?
Can we change some stuff in the generation so it would generate text normally?

Thank you.

Hi. I think the training (fine-tuning) is broken unfortunately. If you look a few examples from model_inputs["labels"] in preprocess_function, your original script will have leading <bos> (token id 0). And the decoder inputs will be prepared as <decoder_start (0)> <bos (0)> ... , and therefore <decoder_start (0)> is trained to predict <bos (0)>. That’s the main cause of the issue.

I don’t know why it works sometimes however. BTW, would you mind sharing what kind of bert models you trained which work?

Thank you and good luck!

Hi @ydshieh ,

So the model will learn to generate <bos (0)> if given input is 0 and so on till the max length.

Though as I told you before, the same preprocessing was also applied to other models, for example lanwuwei/GigaBERT-v4-Arabic-and-English, for this model the decoder_start_token_id=2 and tokens generation for the first steps are the same between the two models, I mean GigaBERT also predicts <bos (2)> for <decoder_start (2)> but after this step it starts generating tokens as expected.

I’ve debugged the generate() method and compare the generation steps of the two models, everything the same, but when it comes to generating next token; MiniLM model generates bos token’s instead.

Hi @AbdelrahmanZ

I see. Notice that even it works for some models, we already see a few times this (similar) issue occurs previously (for Bart/LongFormer, etc, although the cause is somehow different).

But in your case here, you decoder_start_token_id is the same as bos_token_id, but also the same as the pad_token_id. When generate method generates a pad_token_id, there is some logic to say the generation reaches the end, and it won’t generate any new token but padding. This is probably the reason why it doesn’t work in your specific case for MiniLM.

Alright, it seems I need to refine-tune the model again. I will mark the issue as solved.

Thank you @ydshieh, I really appreciate your help, if you need any further information or need to debug the issue further I’ll be glad to help. Good Luck.

1 Like

Just want to share (or reiterate) a tip: It’s always a good idea to run the training with a much smaller subset of the dataset, with much fewer training steps, as the one I shared previoiusly. (I know, you have used the same training script for other models :slight_smile: where it works!)

It’s also a good idea to modify the training script to perform some generateion when saving the checkpoints - and save the generation results for investigation - so we can spot any issue earlier.

Good luck!

1 Like

I just found a parameter for generate(): begin_suppress_tokens(transformers==v4.25.1)
that prevents the generation of bos token at the start, I tried it and now the model generates sequences normally, though I’m not sure if it will affect the overall performance.