Hey there.
For a project I want to create sentence embeddings of text data. I managed to produce embeddings with the paraphrase-multilingual-MiniLM-L12-v2 model, but they were not satisfactory. Therefore, I wanted to train the model on a part of my data and then use the finetuned model to create the embeddings.
Since I have no idea how to code, I mainly used GPT4 to help me. However, it does not succeed in creating a working code, so I would like to ask you to help me produce a code that works.
Here’s my code:
import sys
if ‘google.colab’ in sys.modules: # If in Google Colab environment
# Installing requisite packages
!pip install datasets transformers evaluate
!pip install accelerate -U
!pip install -U sentence-transformers
# Mount google drive to enable access to data files
from google.colab import drive
drive.mount('/content/drive')
# Change working directory to desired folder
%cd /content/drive/MyDrive/HS_23_Msc
Import necessary libraries
import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import AdamW, get_linear_schedule_with_warmup
from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers import models
Load your dataset
import pandas as pd
dat = pd.read_csv(‘Embeddings_text.csv’)
sentences = dat[‘word’].tolist()
Create the training samples. Here, we’re using each sentence as its own label for simplicity.
train_examples = [InputExample(texts=[sentence], label=0) for sentence in sentences]
Create a DataLoader for the training samples
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
Define the model and the training procedure
model_name = ‘sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2’
word_embedding_model = models.Transformer(model_name, max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
Define optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=1e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=10000)
from torch.utils.data import DataLoader, Dataset
from sentence_transformers.losses import CosineSimilarityLoss
class SentenceDataset(Dataset):
def init(self, sentences):
self.sentences = sentences
def __len__(self):
return len(self.sentences)
def __getitem__(self, idx):
return self.sentences[idx], self.sentences[idx] # returning sentence, label (they are the same in this case)
Use your dataset
sentence_dataset = SentenceDataset(dat[‘word+comment’].tolist())
train_dataloader = DataLoader(sentence_dataset, batch_size=batch_size, shuffle=True)
Define loss and optimizer
train_loss = CosineSimilarityLoss(model)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
Train the model
model.train()
for epoch in range(num_epochs):
for sentences, labels in train_dataloader:
# Directly compute loss using the sentences and labels
loss_value = train_loss(sentences, labels)
loss_value.backward()
optimizer.step()
optimizer.zero_grad()
print(“Training complete!”)
Creating embeddings for the entire dataset
features = model.encode(sentences, convert_to_numpy=True)
Exporting the embeddings to a csv
features_df = pd.DataFrame(features)
features_df.to_csv(‘features_tuned.csv’)