Requirement: train a Hugging face model to extract parts from the user input. These are the relevant parts to extract: âentityâ, âintentionâ, and âattributesâ
After interpreting the user input, the model should reply with a JSON object with the following structure:
type Object = {
entity: string; // mandatory string property
intention: 'retrieve' | 'create'; // mandatory property that can only be 'retrieve' or 'create'
attributes?: any; // optional property of any type
};
I trained a t5-small
model with a very small dataset (about 50 objects) that looks like so:
{
"data": [
{
"input": "Give me all projects with a duration of 5 days or more",
"output": "{'intention': 'retrieve', 'entity': 'projects', 'columns': [ 'Name', 'Id', 'Duration']}"
},
{
"input": "List projects",
"output": "{'intention': 'retrieve', 'entity': 'projects', 'columns': [ 'Name', 'Id']}"
},
{
"input": "Show me tasks",
"output": "{'intention': 'retrieve', 'entity': 'tasks', 'columns': [ 'Name', 'Id']}"
},
When I test it by providing an input that looks exactly like the ones in the training dataset (or the evaluation dataset for that matter), for example, âGive me all projects with a duration of 5 days or moreâ, the prediction is either empty, a series of dashes, repeats the input or translates it to a random language.
This is the relevant bit of code:
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from datasets import load_dataset
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
train_dataset = load_dataset("json", data_files={"train": "train.json"}, field="data")
eval_dataset = load_dataset("json", data_files={"eval": "eval.json"}, field="data")
def preprocess_function(examples):
input_texts = [f"extract parameters: {inp}" for inp in examples['input']]
target_texts = examples['output']
inputs = tokenizer(input_texts, truncation=True, padding="max_length", max_length=512)
targets = tokenizer(target_texts, truncation=True, padding="max_length", max_length=512)
inputs["labels"] = targets["input_ids"]
return inputs
train_dataset = train_dataset.map(preprocess_function, batched=True)
eval_dataset = eval_dataset.map(preprocess_function, batched=True)
from transformers.trainer_utils import IntervalStrategy
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy=IntervalStrategy.EPOCH, # Set evaluation strategy to epoch
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
save_strategy=IntervalStrategy.EPOCH, # Set save strategy to epoch
load_best_model_at_end=True, # Load the best model when training ends
)
I know my dataset is too small, but I expected it to return the same output given the same training input, which makes me believe there is something else wrong.