My problem: I’m trying to train a siamese model based on xlm-roberta-base for semantic sentence comparsion. I’ve noticed that the loss on the training data fluctuates but doesn’t decrease. The same happens on the test data. I tried to change the learning rate. I tried 1e-1, 1e-5, 1e-22. The only effect I saw was that the test loss fluctuated with a smaller amplitude. But the train loss fluctuated with the same amplitude. I tried to use such solvers as SGD and Adam. And I had the same problem with both. I have weight_decay = 0. Also, I’ve noticed that almost all values in the outputs of my siamese model tend to 0.99. I can’t even overfit the model on a small train dataset. Please help me to find the mistake.
My dataset consists of pairs of sentences and continuous similarity coefficient in interval from 0 to 1. I tried to use sigmoid function but stopped on (cos_sim + 1)/2 because this choice doesn’t affect gloabally on anything
Remark: I’ve tried a different approach. I trained an ordinary xlm-roberta-base model where the input vector contains two tokenized sentences. I used the model “xlm-roberta for sequence classification” with one output neuron. This approach worked perfectly on my dataset. So I can say that there is no problem with my data. But this is a slow approach for brute force pair comparison. Because of that, I want to train a siamese model.
My criterion:
criterion = nn.MSELoss()
Here’s my siamese’s model architecture:
class SiameseNetwork(nn.Module):
def __init__(self, roberta):
super().__init__()
self.roberta = roberta
self.cosine_similarity = nn.CosineSimilarity(dim=1)
def forward(self, input_ids1, attention_mask1, input_ids2, attention_mask2):
out1 = self.roberta(input_ids=input_ids1, attention_mask=attention_mask1)[0][:, 0]
out2 = self.roberta(input_ids=input_ids2, attention_mask=attention_mask2)[0][:, 0]
cos_sim = self.cosine_similarity(out1, out2)
del out1, out2
gc.collect()
torch.cuda.empty_cache()
return (cos_sim + 1)/2
Here’s my code for train:
siamese.train()
for epoch in range(25):
total_mse_loss = 0
for batch in train_loader:
optimizer.zero_grad()
input_ids1 = batch[0].to(device)
attention_mask1 = batch[1].to(device)
input_ids2 = batch[2].to(device)
attention_mask2 = batch[3].to(device)
labels = batch[4].to(device)
outputs = siamese(input_ids1, attention_mask1, input_ids2, attention_mask2)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
total_mse_loss += loss.item()
del input_ids1, input_ids2, attention_mask1, attention_mask2, labels, outputs
torch.cuda.empty_cache()
gc.collect()
avg_train_mse = total_mse_loss / len(train_loader)
print(f"Epoch {epoch+1} - Training MSE: {avg_train_mse}")
siamese.eval()
test_mse_loss = 0
for batch in test_loader:
with torch.no_grad():
input_ids1 = batch[0].to(device)
attention_mask1 = batch[1].to(device)
input_ids2 = batch[2].to(device)
attention_mask2 = batch[3].to(device)
labels = batch[4].to(device)
outputs = siamese(input_ids1, attention_mask1, input_ids2, attention_mask2)
loss = criterion(outputs, labels)
test_mse_loss += loss.item()
del input_ids1, input_ids2, attention_mask1, attention_mask2, labels, outputs
torch.cuda.empty_cache()
gc.collect()
test_mse_loss /= len(test_loader)
print(f"Epoch {epoch+1} - Test Loss: {test_mse_loss}\n")
siamese.train()
I’ve loaded data such way:
response = requests.get('GSheets link')
response.encoding = 'utf-8'
train_data = pd.read_csv(io.StringIO(response.text), header=None).iloc[:,[0,1,2]]#.drop_duplicates(subset=[0, 1])
train_data[0] = train_data[0].str.lower()#.astype(str).apply(process_text)
train_data[1] = train_data[1].str.lower()#.astype(str).apply(process_text)
train_data[2] = train_data[2].astype("float32")
train_data_pairs, test_data_pairs, train_data_labels, test_data_labels = train_test_split(train_data.iloc[:, [0, 1]], train_data.iloc[:, 2], test_size=0.15)
print(f"Len of train data: {len(train_data_pairs)}")
print(f"Len of test data: {len(test_data_pairs)}")
max_length = max(len(tokenizer.encode(text)) for text in train_data[0].tolist() + train_data[1].tolist())
train_encodings1 = tokenizer(train_data_pairs[0].tolist(), truncation=True, padding='max_length', max_length = max_length, add_special_tokens = True, return_tensors = "pt")
train_encodings2 = tokenizer(train_data_pairs[1].tolist(), truncation=True, padding='max_length', max_length = max_length, add_special_tokens = True, return_tensors = "pt")
train_labels = torch.tensor(train_data_labels.tolist())
test_encodings1 = tokenizer(test_data_pairs[0].tolist(), truncation=True, padding='max_length', max_length = max_length, add_special_tokens = True, return_tensors = "pt")
test_encodings2 = tokenizer(test_data_pairs[1].tolist(), truncation=True, padding='max_length', max_length = max_length, add_special_tokens = True, return_tensors = "pt")
test_labels = torch.tensor(test_data_labels.tolist())
train_dataset = TensorDataset(train_encodings1['input_ids'], train_encodings1['attention_mask'],
train_encodings2['input_ids'], train_encodings2['attention_mask'],
train_labels)
test_dataset = TensorDataset(test_encodings1['input_ids'], test_encodings1['attention_mask'],
test_encodings2['input_ids'], test_encodings2['attention_mask'],
test_labels)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)