<extra_id> when using fine-tuned MT5 for generation

Hi, I am trying to summarize the text in Japanese.

And I found that recently you updated a new script for fine-tuning Seq2Seq model.

So I fine-tuned MT5 model in my Japanese dataset. It contains 100 samples.

CUDA_VISIBLE_DEVICES=0 python examples/seq2seq/run_seq2seq.py   \
  --model_name_or_path google/mt5-small   \
  --do_train     --do_eval     --task summarization   \
  --train_file ~/summary/train.csv --validation_file ~/summary/val.csv  \
   --output_dir ~/tmp/tst-summarization \
 --overwrite_output_dir   \
 --per_device_train_batch_size=4     --per_device_eval_batch_size=4   \
   --predict_with_generate \
 --text_column article --summary_column summary

Then I loaded this fine-tuned model for prediction.

import transformers
from transformers import (
    AutoConfig,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
)

model_path = "../tmp/tst-summarization/"
tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    use_fast=True,
)
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_path,
)

# article = "AI婚活のイメージ内閣府は人工知能(AI)やビッグデータを使った自治体の婚活事業支援に本腰を入れる。AIが膨大な情報を分析し、「相性の良い人」を提案する。お見合い実施率が高まるといった効果が出ている例もあり、2021年度から自治体への補助を拡充し、システム導入を促す。未婚化、晩婚化が少子化の主な要因とされており、結婚を希望する人を後押しする。これまでは本人が希望する年齢や身長、収入などの条件を指定し、その条件に合った人を提示する形が主流だった。AI婚活では性格や価値観などより細かく膨大な会員情報を分析。本人の希望条件に限らずお薦めの人を選び出し、お見合いに進む。"
article = article = """
The man was arrested as he waited to board a plane
at Johannesburg airport. Officials said a scan of
his body revealed the diamonds he had ingested,
worth $2.3m (£1.4m; 1.8m euros), inside. The man
was reportedly of Lebanese origin and was
travelling to Dubai. "We nabbed him just before he
went through the security checkpoint," Paul
Ramaloko, spokesman of the South Africa elite
police unit the Hawks said, according to Agence
France Presse. Authorities believe the man belongs
to a smuggling ring. Another man was arrested in
March also attempting to smuggle diamonds out the
country in a similar way. South Africa is among
the world's top producers of diamonds.
"""
batch = tokenizer.prepare_seq2seq_batch(
    src_texts=[article], max_length=512, truncation=True, return_tensors="pt")
summary_ids = model.generate(batch["input_ids"], num_beams=4, max_length=128,
                             min_length=50, no_repeat_ngram_size=2,
                             early_stopping=True)
print([tokenizer.decode(g, skip_special_tokens=True,
                        clean_up_tokenization_spaces=True) for g in summary_ids])

The results are

['<extra_id_0>。AI婚活では性格や価値観の多様性を分析し、結婚を希望する人を後押しする。本人の希望条件を把握し、「相性の良い人」を提示する形が主流だといえるでしょう。 AI婚活は「相性のいい人」。']
["<extra_id_0> of diamonds was reportedly of Lebanese origin and was travelling to Dubai in March. Johannesburg - South Africa.com.an... <extra_id_51> the man's body, worth $2.3m (£1.4m euros)"]

The question is <extra_id> which is used for unsupervised training for T5 appeared. I mean, it shouldn’t appear in the output text in my opinion. I have tried adding the prefix "summarize: ", but it doesn’t work. Is there any problem with the fine-tuning or using way of the model? Thanks in advance.

@sshleifer @valhalla @sgugger

1 Like

Having the same issue: the extra Id token kind of replace the first word in a sentence. Anyone knows why?

1 Like

I’ve got the same problem. Have you got a solution to this? @Tom @HeroadZ ?

Hi, I fine-tuned MBart and MT5 with the new script in examples/seq2seq, and this problem disappeared. But still don’t know the reason for this problem.

I actually got the problem with the new script. Did you use any specific arguments, besides the ones shown in the Readme?

No specific arguments. This problem appears when the training data is 100. When I fine-tune the model with 100K data, the generated summary works well in my case.

I’m stuck with this issue too.
Does anyone know why this happens?