Code to train a model on my dataset is not working

Choso · October 24, 2023, 6:12pm

Hey there.
For a project I want to create sentence embeddings of text data. I managed to produce embeddings with the paraphrase-multilingual-MiniLM-L12-v2 model, but they were not satisfactory. Therefore, I wanted to train the model on a part of my data and then use the finetuned model to create the embeddings.
Since I have no idea how to code, I mainly used GPT4 to help me. However, it does not succeed in creating a working code, so I would like to ask you to help me produce a code that works.

Here’s my code:
import sys
if ‘google.colab’ in sys.modules: # If in Google Colab environment
# Installing requisite packages
!pip install datasets transformers evaluate
!pip install accelerate -U
!pip install -U sentence-transformers

# Mount google drive to enable access to data files
from google.colab import drive
drive.mount('/content/drive')

# Change working directory to desired folder
%cd /content/drive/MyDrive/HS_23_Msc

Import necessary libraries

import pandas as pd

import torch

from torch.utils.data import DataLoader, Dataset

from transformers import AdamW, get_linear_schedule_with_warmup

from sentence_transformers import SentenceTransformer, InputExample, losses

from sentence_transformers import models

Load your dataset

import pandas as pd

dat = pd.read_csv(‘Embeddings_text.csv’)

sentences = dat[‘word’].tolist()

Create the training samples. Here, we’re using each sentence as its own label for simplicity.

train_examples = [InputExample(texts=[sentence], label=0) for sentence in sentences]

Create a DataLoader for the training samples

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

Define the model and the training procedure

model_name = ‘sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2’

word_embedding_model = models.Transformer(model_name, max_seq_length=256)

pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Define optimizer and scheduler

optimizer = AdamW(model.parameters(), lr=1e-5)

scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=10000)
from torch.utils.data import DataLoader, Dataset
from sentence_transformers.losses import CosineSimilarityLoss
class SentenceDataset(Dataset):
def init(self, sentences):
self.sentences = sentences

def __len__(self):
    return len(self.sentences)

def __getitem__(self, idx):
    return self.sentences[idx], self.sentences[idx]  # returning sentence, label (they are the same in this case)

Use your dataset

sentence_dataset = SentenceDataset(dat[‘word+comment’].tolist())
train_dataloader = DataLoader(sentence_dataset, batch_size=batch_size, shuffle=True)

Define loss and optimizer

train_loss = CosineSimilarityLoss(model)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

Train the model

model.train()
for epoch in range(num_epochs):
for sentences, labels in train_dataloader:
# Directly compute loss using the sentences and labels
loss_value = train_loss(sentences, labels)
loss_value.backward()
optimizer.step()
optimizer.zero_grad()

print(“Training complete!”)

Creating embeddings for the entire dataset

features = model.encode(sentences, convert_to_numpy=True)

Exporting the embeddings to a csv

features_df = pd.DataFrame(features)

features_df.to_csv(‘features_tuned.csv’)

Topic		Replies	Views
SentenceTransformer TrainingArguments torch and accelerate version issue Beginners	0	98	August 28, 2024
Finetuning Transformers for Text Classification Issue 🤗Transformers	2	706	May 11, 2023
Trouble saving and loading a finetuned model Beginners	1	309	July 7, 2024
Language training on a model Beginners	1	402	August 27, 2023
What is wrong with my code Beginners	0	40	October 22, 2024