Hello,
I would like to discuss a couple of issues with the forum, to see if you can help me.
First, I would like to validate my strategy for solving a problem, and then see if you can help me with implementation errors.
I am new to these types of techniques, so I really don’t know if the way I proceeded was correct or if I could have approached the problem in a different way.
The problem is as follows:
I would like to be able to make a program that receives a request in natural language (initially in Spanish) and is capable of returning the bash command that solves it.
For example, if I tell it “I want to know the date” it returns “date” or if I tell it “give me the time” it returns “date +%T”. I know that in many cases it will be easier and faster to write the command than to make the request in natural language, but I would like to do it this way to learn and because I find it an interesting challenge.
To simplify the problem I have first focused on the cat, ls and cd commands. Perhaps the cat is not one of the simplest, as it usually serves as a starting point for more complex requests.
I have started with the dataset, first I have focused on the cat, and with the help of ChatGPT4 I have generated a set of 4000 records, although my intention is to initially do about 10k for each of these most used commands. I have pursued to give variety to the dataset, so I have included requests of different types, using different verbs, with different structures, different user profile, requests that are solved with a simple expression and requests that seek more complex bash expressions…
As a pretrained model I have chosen the mT5, as I believe it is the one that best suits this problem. It is multilingual and as an encoder/decoder architecture is good for translation tasks.
Although I do not yet have the complete dataset, I think I can test with the 4000 requests. Let’s see if I can make it solve Show me file.txt
as cat file.txt
.
In GoogleCloud I have taken a VM L4 with 24GB. I have installed the necessary libraries and have run the following code:
#!/usr/bin/python3
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModel, MT5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments, DataCollatorWithPadding
from transformers import TrainingArguments
from sklearn.model_selection import train_test_split
import pandas as pd
import torch
# Variables
model_name = "google/mt5-small"
dataset_path = './cat.dataset.csv'
cache_dir = "cache_dir/"
checkpoint = None
# We load the dataset as a DatasetDict with the training, validation and test datasets (80/10/10)
print("==================================================================================")
data_df = pd.read_csv(dataset_path, delimiter="##", header=None, names=["request", "command"], engine='python')
data_df['request'] = data_df['request'].str.lower()
data_df['command'] = data_df['command'].str.lower()
train_df, test_df = train_test_split(data_df, test_size=0.1, random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=42)
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)
dataset_dict = DatasetDict({
'train': train_dataset,
'validation': val_dataset,
'test': test_dataset
})
print("Dataset Data: ")
print("Dataset: ", dataset_dict)
print("Max len of request column ", data_df['request'].str.len().max())
print("Max len of command column ", data_df['command'].str.len().max())
print("Features: ",train_dataset.features)
# We load the tokenizer
print("==================================================================================")
tokenizer = T5Tokenizer.from_pretrained(model_name, cache_dir=cache_dir)
print("Tokenizer Data:")
print("Tokenizer: ", tokenizer)
print("Tokenizer special tokens: ", tokenizer.special_tokens_map)
# print("Tokenizer vocab: ", tokenizer.get_vocab())
print("Tokenizer vocab size: ", len(tokenizer.get_vocab()))
# We tokenize the dataset
print("==================================================================================")
def tokenize_function(examples):
inputs = tokenizer(examples["request"], truncation=True, padding='max_length', max_length=64)
labels = tokenizer(examples["command"], truncation=True, padding='max_length', max_length=64)
inputs["labels"] = labels["input_ids"]
return inputs
tokenized_dataset = dataset_dict.map(
tokenize_function,
batched=True
)
print("Tokenized dataset Data: ")
print("Tokenized dataset: ", tokenized_dataset)
# Training arguments
print("==================================================================================")
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=4,
evaluation_strategy="epoch",
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
optim="adamw_torch",
gradient_accumulation_steps=4,
)
print("Training arguments Data: ")
print("Training arguments: ", training_args)
# We load the model
print("==================================================================================")
model = MT5ForConditionalGeneration.from_pretrained(model_name, cache_dir=cache_dir)
print("Model Data: ")
print("Model: ", model)
# We create the trainer and train
print("==================================================================================")
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
def compute_metrics(eval_preds):
metric = load_metric('sacrebleu') # You can use other metrics like "bleu", "rouge" etc.
preds = tokenizer.batch_decode(eval_preds.predictions, skip_special_tokens=True)
labels = tokenizer.batch_decode(eval_preds.label_ids, skip_special_tokens=True)
sacrebleu_score = metric.compute(predictions=[preds], references=[[labels]])
return {
'sacrebleu': sacrebleu_score["score"],
}
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train(checkpoint if checkpoint else None)
But I get the following error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.63 GiB (GPU 0; 22.02 GiB total capacity; 12.14 GiB already allocated; 7.58 GiB free; 13.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have tried with different batch sizes and gradient_accumulation_steps, but I always get the same error.
It usually happens in what I think is the evaluation. I do not have enough GPU memory, but I do not know if it is a problem of the VM or the implementation. Can anyone help me?.
I don’t know what hardware I would need to fine-tune a model of this type.
I would appreciate any comments on the problem, the strategy or the error.
Thank you very much for everything.
Best regards.