Encoder-Decoder model only generates bos_token's [<s><s><s>]

AbdelrahmanZ · November 21, 2022, 8:30pm

Hello,

I fine-tuned bert2bert like model as encoder-decoder for translation task, the model used is microsoft/Multilingual-MiniLM-L12-H384 with XLMRobertaTokenizer,
When I run the fine-tuned model using generate() method, it only generates bos_token until it reaches limit of decoding length,
The decoder_start_token_id is set the same as bos_token_id
I’ve checked some related Github issues, but they were for brat model.

github.com/huggingface/transformers

BART Large generate predictions are wonky

opened 03:23PM - 08 Feb 22 UTC

closed 03:02PM - 15 May 22 UTC

StephAO

## Environment info  - `transformers` version: 4.16.2 (issue exists on 4.9.2) - Platform: Linux-4.4.0-210-generic-x86_64-with-glibc2.10 - Python version: 3.8.10 - PyTorch version (GPU?): 1.8.1+cpu (False) - Tensorflow version (GPU?): 2.3.1 (False) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: <fill in> - Using distributed or parallel set-up in script?: <fill in> ### Who can help @patrickvonplaten @sshleifer ## Information Essentially re-opening [issue 8005](https://github.com/huggingface/transformers/issues/8005), BART-large does not mask fill properly (whereas BART-base has entirely reasonable outputs). The previous fix of setting `force_bos_token_to_be_generated = True` is no longer viable since the option no longer exists in BART config. It also seems like adjust_logits_during_generation (where force_bos_token_to_be_generated was used) is no longer implemented in the BART model. ## To reproduce Steps to reproduce the behavior: ``` tokenizer = BartTokenizer.from_pretrained("facebook/bart-base", forced_bos_token_id=0) model = BartForConditionalGeneration.from_pretrained("facebook/bart-base") batch = tokenizer("My friends are <mask> but they eat too many carbs.", return_tensors="pt") generated_ids = model.generate(batch["input_ids"]) print(tokenizer.decode(generated_ids[0])) # Output: </s><s>My friends are healthy, but they eat too many carbs.</s> tokenizer = BartTokenizer.from_pretrained("facebook/bart-large", forced_bos_token_id=0) model = BartForConditionalGeneration.from_pretrained("facebook/bart-large") batch = tokenizer("My friends are <mask> but they eat too many carbs.", return_tensors="pt") generated_ids = model.generate(batch["input_ids"]) print(tokenizer.decode(generated_ids[0])) # Output: </s>My,, but they eat too many carbs.</s>```

github.com/huggingface/transformers

Longformer EncoderDecoder (LED)-Large model finetuning for summarization results in </s><s><s><s><s><s><s><s><s><s><s>... output

opened 06:51PM - 18 Jul 22 UTC

closed 07:08AM - 14 Oct 22 UTC

ratishsp

bug

### System Info - `transformers` version: 4.20.0.dev0 - Platform: Linux-4.18….0-348.23.1.el8_5.x86_64-x86_64-with-centos-8.6-Green_Obsidian - Python version: 3.7.13 - Huggingface_hub version: 0.7.0 - PyTorch version (GPU?): 1.11.0 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed ### Who can help? @ydshieh ### Information - [X] The official example scripts - [ ] My own modified scripts ### Tasks - [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction ``` OUTPUT_DIR=/home/ratish/project python -m torch.distributed.launch --nproc_per_node=1 examples/pytorch/summarization/run_summarization.py \ --model_name_or_path allenai/led-large-16384 \ --do_train \ --do_eval \ --dataset_name xsum \ --output_dir ${OUTPUT_DIR} \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --overwrite_output_dir \ --predict_with_generate \ --overwrite_output_dir \ --logging_dir logs \ --evaluation_strategy steps \ --eval_steps 100 \ --logging_steps 100 \ --report_to tensorboard \ --save_total_limit 5 \ --save_steps 100 \ --load_best_model_at_end \ --greater_is_better True \ --metric_for_best_model rougeL \ --max_eval_samples 100 \ --num_beams 3 ``` The logs shows that at checkpoint 1800 the rouge becomes zero. `{'eval_loss': 2.172360897064209, 'eval_rouge1': 0.0, 'eval_rouge2': 0.0, 'eval_rougeL': 0.0, 'eval_rougeLsum': 0.0, 'eval_gen_len': 20.0, 'eval_runtime': 10.2823, 'eval_samples_per_second': 9.725, 'eval_steps_per_second': 2.431, 'epoch': 0.04}` I evaluate the model output using the below function: ``` def generate_output(): import torch from transformers import LEDTokenizer, LEDForConditionalGeneration MODEL="/home/ratish/checkpoint-1800" model = LEDForConditionalGeneration.from_pretrained(MODEL) tokenizer = LEDTokenizer.from_pretrained(MODEL) ARTICLE_TO_SUMMARIZE = "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct." inputs = tokenizer.encode(ARTICLE_TO_SUMMARIZE, return_tensors="pt") global_attention_mask = torch.zeros_like(inputs) global_attention_mask[:, 0] = 1 summary_ids = model.generate(inputs, global_attention_mask=global_attention_mask, num_beams=3, max_length=32) print(tokenizer.decode(summary_ids[0], skip_special_tokens=False, clean_up_tokenization_spaces=False)) ``` It produces the output `</s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>` ### Expected behavior The model should produce the summary of the news article.

and the solutions proposed didn’t work for my model,

Any suggestions or ideas would be appreciated,
Thanks.

ydshieh · November 23, 2022, 8:01am

Hi @AbdelrahmanZ , could you provide a script (or the command you used to launch the training) that could reproduce the issue, please?

AbdelrahmanZ · November 23, 2022, 8:55am

Actually, the training script is a little bit huge, though I will share the most important parts:
Loading the model:

            tokenizer = XLMRobertaTokenizer.from_pretrained("microsoft/Multilingual-MiniLM-L12-H384")
            tokenizer.bos_token = tokenizer.cls_token
            tokenizer.eos_token = tokenizer.sep_token
            encoder_decoder_model = EncoderDecoderModel.from_encoder_decoder_pretrained(
                "microsoft/Multilingual-MiniLM-L12-H384",
                "microsoft/Multilingual-MiniLM-L12-H384",
            )
            encoder_decoder_model.config.decoder_start_token_id = tokenizer.bos_token_id
            encoder_decoder_model.config.eos_token_id = tokenizer.eos_token_id
            encoder_decoder_model.config.pad_token_id = tokenizer.pad_token_id

Single (source, target) example tokenization:

        model_inputs = self.tokenizer(
            examples["source"].strip(),
            max_length=self.params["encoder_max_length"],
            padding=False,
            truncation=True,
        )
        targets = self.tokenizer(
            examples["target"].strip(),
            max_length=self.params["decoder_max_length"],
            padding=False,
            truncation=True,
        )
        model_inputs["labels"] = targets["input_ids"]

Then I create Seq2SeqTrainer, and train the model.
For inference, this is the generation config:

generation_config = dict(
    max_length=None,
    min_length=None,
    do_sample=False,
    early_stopping=True,
    num_beams=1,
    temperature=1.0,
    top_k=None,
    top_p=None,
    length_penalty=1.0,  # > 1.0 longer sequences, < 1.0 shorter sequences
    num_return_sequences=1,
    max_time=None,  # in seconds
    num_beam_groups=1,
    output_scores=False,
)

call to generate():

    inputs = tokenizer(text, return_tensors="pt")
    input_ids = inputs.input_ids
    generated_texts = model.generate(input_ids, **generation_config)

I hope this is enough to reproduce the issue. Thank you @ydshieh .

ydshieh · November 24, 2022, 4:13pm

Hi, @AbdelrahmanZ Thank you for the effort. It would still be much easier if a single training script could be provided. Do you mind to copy the code and paste in a Google Colab notebook (in a single cell), or even put your script in a GitHub repository. Otherwise, you can send me an email if no other ways fit. Thank you!

ydshieh · November 28, 2022, 1:46pm

Continue the discussion here.

From @ AbdelrahmanZ

I just shared the notebook, I trained the model on Arabic to English translation task. generate() works fine, and the issue doesn’t appear again.
The issue of generating [<s> <s> <s>...] happened with me for training on Arabic to Arabic task. I think it could be related to:

The type of source and target languages

Maybe it appears only after a large number of training steps

I should rerun the finetuning and check the checkpoints gradually, unfortunately this would be painful to me as I need to wait at least two weeks to finish.

ydshieh · November 28, 2022, 1:53pm

Hi @AbdelrahmanZ

In the notebook, you have num_train_epochs=2, but you mentioned Maybe it appears only after a large number of training steps? What’s the num_train_epochs you used when you train on Arabic to Arabic (for which you get the issue)?

Would you be able to reproduce with Arabic to Arabic but keeping everything else the same (with the notebook)?

From your description, it seems there is no real bug in our codebase, but it is still interesting to try to figure out why you got the issue from the 1st place - if you can find the situation that can reproduce it, and see if there is indeed something we can/should fix.

Thank you!

AbdelrahmanZ · November 28, 2022, 2:01pm

Hi @ydshieh

Regarding the num_train_epochs it was just an assumption, I’m not sure if this is the cause.
I will try to rerun the finetuning and modify some hyper-parameters, hopefully the issue will be reproduced.

AbdelrahmanZ · November 29, 2022, 10:35am

Hi @ydshieh

I think the issue is reproduced. The issue appeared when the model got initialized from a checkpoint.
Note this didn’t happen on the same model instance that got fine-tuned, only after initializing the model from a checkpoint.
So I believe the steps to reproduce the issue are:

Run the fine-tuning script(the notebook previously shared).
Save a checkpoint.
Run the following script that initialize a new instance of the model from the saved checkpoint.

import torch
from transformers import (
    EncoderDecoderModel,
    XLMRobertaTokenizer,
)
generation_config = dict(
    max_length=None,
    min_length=None,
    do_sample=False,
    early_stopping=True,
    num_beams=1,
    temperature=1.0,
    top_k=None,
    top_p=None,
    length_penalty=1.0,  # > 1.0 longer sequences, < 1.0 shorter sequences
    num_return_sequences=1,
    max_time=None,  # in seconds
    num_beam_groups=1,
    output_scores=False,
)

model_path = "checkpoint-path"
tokenizer = XLMRobertaTokenizer.from_pretrained(model_path)
model = EncoderDecoderModel.from_pretrained(model_path)
model.config.vocab_size = model.config.decoder.vocab_size
model = model.eval()
text = "أجرت فرقة العمل، يشترك في رئاسته جيف موس وآلان Paller، مقابلات مكثفة مع خبراء من الحكومة والقطاع الخاص، والأوساط الأكاديمية في تطوير توصياتها لتنمية المهارات الفنية المتقدمة للأمن السيبراني القوى العاملة DHS وتوسيع خط أنابيب الوطني للرجال والنساء مع هذه المهارات الأمن السيبراني."
inputs = tokenizer(text, return_tensors="pt")
input_ids = inputs.input_ids
generated_texts = model.generate(input_ids, **generation_config)
print(generated_texts)

requirements:

transformers==4.17.0
torch==1.10.1
sentencepiece==0.1.96

I also tried latest transformers version, but still the same behavior.

ydshieh · November 29, 2022, 11:57am

Hi Abdelrahman Alzboon!

Thank you a lot for the effort. Could you post your finding above to the forum thread that you created.
We will definitely look into this issue.

Cheers,
Yih-Dar

Abdelrahman Alzboon via Hugging Face Forums <notifications@hellohellohello.discoursemail.com> 於 2022年11月29日週二上午11:45寫道：

ydshieh · November 30, 2022, 3:48pm

I can reproduce the issue with the notebook - with smaller datasets/short sequences. In my case, no need to save and reload, the issue occurs after some training steps.

Will definitely look into this.

ydshieh · December 1, 2022, 3:44pm

Hi , @AbdelrahmanZ

Have you tried to

  model_inputs["labels"] = model_inputs["labels"][1:]

after

  model_inputs["labels"] = targets["input_ids"]

inside

preprocess_function

I modified your script a bit for faster debugging, and adding this extra line works for me.

I would suggest you to try the fix with a smaller subset of the dataset, short sequence, fewer steps to make sure it works on your side before launching large training.

Here is the modified notebook

AbdelrahmanZ · December 2, 2022, 9:21pm

Hi @ydshieh ,

Great catch, unfortunately it requires refine-tuning the model from scratch.
I’ve used the same script before with other bert-like models and it works well.
Is the problem related to fine-tuning and input-preparation?
Can we change some stuff in the generation so it would generate text normally?

Thank you.

ydshieh · December 3, 2022, 8:49am

Hi. I think the training (fine-tuning) is broken unfortunately. If you look a few examples from model_inputs["labels"] in preprocess_function, your original script will have leading <bos> (token id 0). And the decoder inputs will be prepared as <decoder_start (0)> <bos (0)> ... , and therefore <decoder_start (0)> is trained to predict <bos (0)>. That’s the main cause of the issue.

I don’t know why it works sometimes however. BTW, would you mind sharing what kind of bert models you trained which work?

Thank you and good luck!

AbdelrahmanZ · December 4, 2022, 10:42am

Hi @ydshieh ,

So the model will learn to generate <bos (0)> if given input is 0 and so on till the max length.

Though as I told you before, the same preprocessing was also applied to other models, for example lanwuwei/GigaBERT-v4-Arabic-and-English, for this model the decoder_start_token_id=2 and tokens generation for the first steps are the same between the two models, I mean GigaBERT also predicts <bos (2)> for <decoder_start (2)> but after this step it starts generating tokens as expected.

I’ve debugged the generate() method and compare the generation steps of the two models, everything the same, but when it comes to generating next token; MiniLM model generates bos token’s instead.

ydshieh · December 4, 2022, 11:11am

Hi @AbdelrahmanZ

I see. Notice that even it works for some models, we already see a few times this (similar) issue occurs previously (for Bart/LongFormer, etc, although the cause is somehow different).

But in your case here, you decoder_start_token_id is the same as bos_token_id, but also the same as the pad_token_id. When generate method generates a pad_token_id, there is some logic to say the generation reaches the end, and it won’t generate any new token but padding. This is probably the reason why it doesn’t work in your specific case for MiniLM.

AbdelrahmanZ · December 4, 2022, 11:21am

Alright, it seems I need to refine-tune the model again. I will mark the issue as solved.

Thank you @ydshieh, I really appreciate your help, if you need any further information or need to debug the issue further I’ll be glad to help. Good Luck.

ydshieh · December 4, 2022, 11:30am

Just want to share (or reiterate) a tip: It’s always a good idea to run the training with a much smaller subset of the dataset, with much fewer training steps, as the one I shared previoiusly. (I know, you have used the same training script for other models where it works!)

It’s also a good idea to modify the training script to perform some generateion when saving the checkpoints - and save the generation results for investigation - so we can spot any issue earlier.

Good luck!

AbdelrahmanZ · December 6, 2022, 1:15pm

I just found a parameter for generate(): begin_suppress_tokens(transformers==v4.25.1)
that prevents the generation of bos token at the start, I tried it and now the model generates sequences normally, though I’m not sure if it will affect the overall performance.

Topic		Replies	Views
How to use BART as an encoder and a decoder separately for summarization? 🤗Transformers	1	815	September 22, 2021
BERT for Generative Chatbot 🤗Transformers	1	559	July 26, 2021
Train Bart for Conditional Generation (e.g. Summarization) Models	14	17159	November 22, 2023
Inconsistent Model/Pipeline Behavior using Automodel/Pipeline/BartForConditionalGeneration 🤗Transformers	3	882	February 16, 2021
BART generate() output not related to input Intermediate	1	814	February 17, 2022

Encoder-Decoder model only generates bos_token's [<s><s><s>]

Related topics