Tune mT5 for translation of natural language requests to bash

Hello,

I would like to discuss a couple of issues with the forum, to see if you can help me.

First, I would like to validate my strategy for solving a problem, and then see if you can help me with implementation errors.

I am new to these types of techniques, so I really don’t know if the way I proceeded was correct or if I could have approached the problem in a different way.

The problem is as follows:

I would like to be able to make a program that receives a request in natural language (initially in Spanish) and is capable of returning the bash command that solves it.

For example, if I tell it “I want to know the date” it returns “date” or if I tell it “give me the time” it returns “date +%T”. I know that in many cases it will be easier and faster to write the command than to make the request in natural language, but I would like to do it this way to learn and because I find it an interesting challenge.

To simplify the problem I have first focused on the cat, ls and cd commands. Perhaps the cat is not one of the simplest, as it usually serves as a starting point for more complex requests.

I have started with the dataset, first I have focused on the cat, and with the help of ChatGPT4 I have generated a set of 4000 records, although my intention is to initially do about 10k for each of these most used commands. I have pursued to give variety to the dataset, so I have included requests of different types, using different verbs, with different structures, different user profile, requests that are solved with a simple expression and requests that seek more complex bash expressions…

As a pretrained model I have chosen the mT5, as I believe it is the one that best suits this problem. It is multilingual and as an encoder/decoder architecture is good for translation tasks.

Although I do not yet have the complete dataset, I think I can test with the 4000 requests. Let’s see if I can make it solve Show me file.txt as cat file.txt.

In GoogleCloud I have taken a VM L4 with 24GB. I have installed the necessary libraries and have run the following code:

#!/usr/bin/python3
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModel, MT5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments, DataCollatorWithPadding
from transformers import TrainingArguments
from sklearn.model_selection import train_test_split
import pandas as pd
import torch


# Variables
model_name = "google/mt5-small"
dataset_path = './cat.dataset.csv'
cache_dir = "cache_dir/"
checkpoint = None 


# We load the dataset as a DatasetDict with the training, validation and test datasets (80/10/10)
print("==================================================================================")
data_df = pd.read_csv(dataset_path, delimiter="##", header=None, names=["request", "command"], engine='python')
data_df['request'] = data_df['request'].str.lower()
data_df['command'] = data_df['command'].str.lower()
train_df, test_df = train_test_split(data_df, test_size=0.1, random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=42)
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)
dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset
})

print("Dataset Data: ")
print("Dataset: ", dataset_dict)
print("Max len of request column ", data_df['request'].str.len().max())
print("Max len of command column ", data_df['command'].str.len().max())
print("Features: ",train_dataset.features)


# We load the tokenizer
print("==================================================================================")
tokenizer = T5Tokenizer.from_pretrained(model_name, cache_dir=cache_dir)

print("Tokenizer Data:")
print("Tokenizer: ", tokenizer)
print("Tokenizer special tokens: ", tokenizer.special_tokens_map)
# print("Tokenizer vocab: ", tokenizer.get_vocab())
print("Tokenizer vocab size: ", len(tokenizer.get_vocab()))


# We tokenize the dataset
print("==================================================================================")
def tokenize_function(examples):
    inputs = tokenizer(examples["request"], truncation=True, padding='max_length', max_length=64)
    labels = tokenizer(examples["command"], truncation=True, padding='max_length', max_length=64)
    inputs["labels"] = labels["input_ids"]
    return inputs

tokenized_dataset = dataset_dict.map(
  tokenize_function, 
  batched=True
)

print("Tokenized dataset Data: ")
print("Tokenized dataset: ", tokenized_dataset)


# Training arguments
print("==================================================================================")
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=4,
    evaluation_strategy="epoch",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    optim="adamw_torch",
    gradient_accumulation_steps=4,
)

print("Training arguments Data: ")
print("Training arguments: ", training_args)

# We load the model
print("==================================================================================")
model = MT5ForConditionalGeneration.from_pretrained(model_name, cache_dir=cache_dir)

print("Model Data: ")
print("Model: ", model)

# We create the trainer and train
print("==================================================================================")
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
def compute_metrics(eval_preds):
    metric = load_metric('sacrebleu')  # You can use other metrics like "bleu", "rouge" etc.
    preds = tokenizer.batch_decode(eval_preds.predictions, skip_special_tokens=True)
    labels = tokenizer.batch_decode(eval_preds.label_ids, skip_special_tokens=True)
    sacrebleu_score = metric.compute(predictions=[preds], references=[[labels]])
    return {
        'sacrebleu': sacrebleu_score["score"],
    }

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics, 
)

trainer.train(checkpoint if checkpoint else None)

But I get the following error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.63 GiB (GPU 0; 22.02 GiB total capacity; 12.14 GiB already allocated; 7.58 GiB free; 13.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have tried with different batch sizes and gradient_accumulation_steps, but I always get the same error.

It usually happens in what I think is the evaluation. I do not have enough GPU memory, but I do not know if it is a problem of the VM or the implementation. Can anyone help me?.

I don’t know what hardware I would need to fine-tune a model of this type.

I would appreciate any comments on the problem, the strategy or the error.

Thank you very much for everything.

Best regards.

Hello,

I have been experimenting with the code. I removed the compute_metrics function and now it has worked even for larger batch sizes.

After adjusting the mt5 model with about 40 epochs of my data, it has made many of the predictions correctly. Many others not, the more complex ones. But it’s enough to motivate me and keep experimenting. I will continue with the dataset and when it is more complete, I will upload it to the platform.

Thank you very much. I am learning a lot from your courses and I think it’s an amazing library and platform. I hope to be able to contribute to the community in the same way.