mT5 Model Only Returns <extra_id_0> Token When Generating Answers

skluzek · March 17, 2024, 10:22am

I’m working with the mT5 model from the Hugging Face Transformers library for a question-answering chatbot project. Despite training the model on a diverse dataset of question-answer pairs, the model consistently returns the token <extra_id_0> as its response, regardless of the input question. here’s my code.

train.py - training model

import torch
from transformers import (
    MT5ForConditionalGeneration,
    T5Tokenizer,
    TrainingArguments,
    Trainer,
)
from datasets import DatasetDict, Dataset

model_name = "google/mt5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)

data = {
    "train": [
        {"input": "Pytanie: Jak się masz?", "output": "Odpowiedź: Dobrze, dziękuję."},
        {"input": "Pytanie: Kim jesteś?", "output": "Odpowiedź: Jestem chatbotem."},
        {"input": "Pytanie: Jaka jest pogoda?", "output": "Odpowiedź: Niestety, nie mogę sprawdzić aktualnej pogody."},
        {"input": "Pytanie: Opowiedz mi coś ciekawego.", "output": "Odpowiedź: Czy wiesz, że Oktawian August był pierwszym cesarzem Rzymu?"},
        {"input": "Pytanie: Jakie masz hobby?", "output": "Odpowiedź: Jako chatbot, spędzam czas na uczeniu się nowych rzeczy."},
        {"input": "Pytanie: Ile masz lat?", "output": "Odpowiedź: Jako program komputerowy, nie mam określonego wieku."},
        {"input": "Pytanie: Co lubisz jeść?", "output": "Odpowiedź: Jako chatbot, nie potrzebuję jedzenia."},
        {"input": "Pytanie: Jak mogę ci pomóc?", "output": "Odpowiedź: Możesz zadawać mi pytania, a ja postaram się na nie odpowiedzieć."},
        {"input": "Pytanie: Czy masz rodzinę?", "output": "Odpowiedź: Jako chatbot, nie mam rodziny w ludzkim znaczeniu tego słowa."},
        {"input": "Pytanie: Czym jest AI?", "output": "Odpowiedź: AI, czyli sztuczna inteligencja, to gałąź informatyki zajmująca się tworzeniem maszyn zdolnych do wykonywania zadań wymagających ludzkiej inteligencji."},
    ],
    "validation": [
        {
            "input": "Pytanie: Co to jest Python?",
            "output": "Odpowiedź: Python to wysokopoziomowy język programowania, znany ze swojej czytelności i elastyczności.",
        },
        {
            "input": "Pytanie: Dlaczego niebo jest niebieskie?",
            "output": "Odpowiedź: Niebo wydaje się niebieskie, ponieważ molekuły powietrza rozpraszają niebieskie światło słoneczne bardziej niż inne kolory.",
        },
        {
            "input": "Pytanie: Co to jest czarna dziura?",
            "output": "Odpowiedź: Czarna dziura to region w przestrzeni, gdzie grawitacja jest tak silna, że nic, nawet światło, nie może się z niej wydostać.",
        },
        {
            "input": "Pytanie: Czym jest fotosynteza?",
            "output": "Odpowiedź: Fotosynteza to proces, w którym rośliny, algi i niektóre bakterie przekształcają światło słoneczne na energię chemiczną.",
        },
        {
            "input": "Pytanie: Jak działa internet?",
            "output": "Odpowiedź: Internet działa poprzez połączenie milionów komputerów na całym świecie za pomocą sieci telekomunikacyjnych i satelitarnych.",
        },
    ],
}


def transform_data_format(data):
    transformed_data = {"input": [], "output": []}
    for item in data:
        transformed_data["input"].append(item["input"])
        transformed_data["output"].append(item["output"])
    return transformed_data

train_data_transformed = transform_data_format(data["train"])
val_data_transformed = transform_data_format(data["validation"])

train_dataset = Dataset.from_dict(train_data_transformed)
val_dataset = Dataset.from_dict(val_data_transformed)
dataset = DatasetDict({"train": train_dataset, "validation": val_dataset})

def tokenize_data(examples):
    model_inputs = tokenizer(examples["input"], padding="max_length", truncation=True, max_length=512)
    labels = tokenizer(examples["output"], padding="max_length", truncation=True, max_length=512)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(tokenize_data, batched=True)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    save_strategy="epoch",
    evaluation_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

trainer.train()

model.save_pretrained("./results")
tokenizer.save_pretrained("./results")

here’s chat.py - loading model and answering to user

from transformers import MT5ForConditionalGeneration, T5Tokenizer

model_checkpoint_path = "./results"

tokenizer = T5Tokenizer.from_pretrained(model_checkpoint_path)
model = MT5ForConditionalGeneration.from_pretrained(model_checkpoint_path)


def generate_answer_seq2seq(question):
    formatted_question = f"translate Polish to Polish: {question}"

    input_ids = tokenizer.encode(formatted_question, return_tensors="pt")

    output_ids = model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True)

    answer = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return answer


question_seq2seq = "Pytanie: Co robisz?"
print(generate_answer_seq2seq(question_seq2seq))

I was expecting my model to answer as I declared in data section.

I’ve formatted my training and validation data as pairs of inputs and outputs, with questions prefixed with “Question:” and answers with “Answer:”.
The model has been trained using the Trainer API from Hugging Face, with a custom dataset created from a dictionary and loaded into a DatasetDict object.
After training, when using the model to generate answers to new questions, it only returns the <extra_id_0> token, not a readable text answer.
I’ve experimented with various generation parameters (max_length, num_beams, early_stopping, etc.) but still only receive the <extra_id_0> token as output.
I’ve also tried to change formatted_question onto f"Pytanie: {question}", but it also gives me the same token as output.

I’m looking for guidance on troubleshooting this issue and any tips on optimizing the mT5 model for a question-answering chatbot application. Any advice or insights from those with experience in working with mT5 or similar models for text generation would be greatly appreciated.

Topic		Replies	Views
T5 generate() output doesn't produce <extra_id_0> 🤗Transformers	1	2238	July 18, 2022
Adding tokens to mT5, tensorflow get ValueError 🤗Transformers	0	468	August 9, 2021
Add_tokens + finetune 🤗Transformers	0	521	February 25, 2022
What happens in the MT5 documentation example? Beginners	3	2017	January 11, 2021
<extra_id> when using fine-tuned MT5 for generation Beginners	9	3054	April 15, 2024

mT5 Model Only Returns <extra_id_0> Token When Generating Answers

Related topics