T5-small trained with small dataset not infering anything

dpiret · April 25, 2023, 7:33pm

Requirement: train a Hugging face model to extract parts from the user input. These are the relevant parts to extract: ‘entity’, ‘intention’, and ‘attributes’

After interpreting the user input, the model should reply with a JSON object with the following structure:

type Object = {
  entity: string; // mandatory string property
  intention: 'retrieve' | 'create'; // mandatory property that can only be 'retrieve' or 'create'
  attributes?: any; // optional property of any type
};

I trained a t5-small model with a very small dataset (about 50 objects) that looks like so:

{
    "data": [
        {
            "input": "Give me all projects with a duration of 5 days or more",
            "output": "{'intention': 'retrieve', 'entity': 'projects', 'columns': [ 'Name', 'Id', 'Duration']}"
        },
        {
            "input": "List projects",
            "output": "{'intention': 'retrieve', 'entity': 'projects', 'columns': [ 'Name', 'Id']}"
        },
        {
            "input": "Show me tasks",
            "output": "{'intention': 'retrieve', 'entity': 'tasks', 'columns': [ 'Name', 'Id']}"
        },

When I test it by providing an input that looks exactly like the ones in the training dataset (or the evaluation dataset for that matter), for example, “Give me all projects with a duration of 5 days or more”, the prediction is either empty, a series of dashes, repeats the input or translates it to a random language.
This is the relevant bit of code:

from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from datasets import load_dataset

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

train_dataset = load_dataset("json", data_files={"train": "train.json"}, field="data")
eval_dataset = load_dataset("json", data_files={"eval": "eval.json"}, field="data")

def preprocess_function(examples):
    input_texts = [f"extract parameters: {inp}" for inp in examples['input']]
    target_texts = examples['output']
    inputs = tokenizer(input_texts, truncation=True, padding="max_length", max_length=512)
    targets = tokenizer(target_texts, truncation=True, padding="max_length", max_length=512)
    inputs["labels"] = targets["input_ids"]

    return inputs

train_dataset = train_dataset.map(preprocess_function, batched=True)
eval_dataset = eval_dataset.map(preprocess_function, batched=True)

from transformers.trainer_utils import IntervalStrategy

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy=IntervalStrategy.EPOCH, # Set evaluation strategy to epoch
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy=IntervalStrategy.EPOCH, # Set save strategy to epoch
    load_best_model_at_end=True, # Load the best model when training ends
)

I know my dataset is too small, but I expected it to return the same output given the same training input, which makes me believe there is something else wrong.

Topic		Replies	Views
Confusion About T5LM Properties Models	1	393	January 7, 2022
T5 Finetuning not converging Models	0	478	August 18, 2023
How is T5 pretrained? 🤗Transformers	3	510	July 12, 2021
Finetuning T5 on Squad 🤗Transformers	1	569	November 29, 2023
Finetuning T5 for Summarisation - Poor results Intermediate	1	529	April 28, 2024

T5-small trained with small dataset not infering anything

Related topics