Train loss is not decreasing on siamese model based on xlm-roberta

My problem: I’m trying to train a siamese model based on xlm-roberta-base for semantic sentence comparsion. I’ve noticed that the loss on the training data fluctuates but doesn’t decrease. The same happens on the test data. I tried to change the learning rate. I tried 1e-1, 1e-5, 1e-22. The only effect I saw was that the test loss fluctuated with a smaller amplitude. But the train loss fluctuated with the same amplitude. I tried to use such solvers as SGD and Adam. And I had the same problem with both. I have weight_decay = 0. Also, I’ve noticed that almost all values in the outputs of my siamese model tend to 0.99. I can’t even overfit the model on a small train dataset. Please help me to find the mistake.

My dataset consists of pairs of sentences and continuous similarity coefficient in interval from 0 to 1. I tried to use sigmoid function but stopped on (cos_sim + 1)/2 because this choice doesn’t affect gloabally on anything

Remark: I’ve tried a different approach. I trained an ordinary xlm-roberta-base model where the input vector contains two tokenized sentences. I used the model “xlm-roberta for sequence classification” with one output neuron. This approach worked perfectly on my dataset. So I can say that there is no problem with my data. But this is a slow approach for brute force pair comparison. Because of that, I want to train a siamese model.
My criterion:

criterion = nn.MSELoss()

Here’s my siamese’s model architecture:

class SiameseNetwork(nn.Module):
    def __init__(self, roberta):
        super().__init__()
        self.roberta = roberta
        self.cosine_similarity = nn.CosineSimilarity(dim=1)

    def forward(self, input_ids1, attention_mask1, input_ids2, attention_mask2):
        out1 = self.roberta(input_ids=input_ids1, attention_mask=attention_mask1)[0][:, 0]
        out2 = self.roberta(input_ids=input_ids2, attention_mask=attention_mask2)[0][:, 0]

        cos_sim = self.cosine_similarity(out1, out2)
        del out1, out2
        gc.collect()
        torch.cuda.empty_cache()
        return (cos_sim + 1)/2

Here’s my code for train:

siamese.train()

for epoch in range(25):
  total_mse_loss = 0
  for batch in train_loader:
    optimizer.zero_grad()
    input_ids1 = batch[0].to(device)
    attention_mask1 = batch[1].to(device)
    input_ids2 = batch[2].to(device)
    attention_mask2 = batch[3].to(device)
    labels = batch[4].to(device)

    outputs = siamese(input_ids1, attention_mask1, input_ids2, attention_mask2)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

    total_mse_loss += loss.item()
    del input_ids1, input_ids2, attention_mask1, attention_mask2, labels, outputs
    torch.cuda.empty_cache()
    gc.collect()
  avg_train_mse = total_mse_loss / len(train_loader)
  print(f"Epoch {epoch+1} - Training MSE: {avg_train_mse}")

  siamese.eval()
  test_mse_loss = 0
  for batch in test_loader:
    with torch.no_grad():
        input_ids1 = batch[0].to(device)
        attention_mask1 = batch[1].to(device)
        input_ids2 = batch[2].to(device)
        attention_mask2 = batch[3].to(device)
        labels = batch[4].to(device)

        outputs = siamese(input_ids1, attention_mask1, input_ids2, attention_mask2)
        loss = criterion(outputs, labels)

        test_mse_loss += loss.item()
        del input_ids1, input_ids2, attention_mask1, attention_mask2, labels, outputs
        torch.cuda.empty_cache()
        gc.collect()
  test_mse_loss /= len(test_loader)
  print(f"Epoch {epoch+1} - Test Loss: {test_mse_loss}\n")
  siamese.train()

I’ve loaded data such way:

response = requests.get('GSheets link')
response.encoding = 'utf-8'
train_data = pd.read_csv(io.StringIO(response.text), header=None).iloc[:,[0,1,2]]#.drop_duplicates(subset=[0, 1])

train_data[0] = train_data[0].str.lower()#.astype(str).apply(process_text)
train_data[1] = train_data[1].str.lower()#.astype(str).apply(process_text)
train_data[2] = train_data[2].astype("float32")

train_data_pairs, test_data_pairs, train_data_labels, test_data_labels = train_test_split(train_data.iloc[:, [0, 1]], train_data.iloc[:, 2], test_size=0.15)

print(f"Len of train data: {len(train_data_pairs)}")
print(f"Len of test data: {len(test_data_pairs)}")

max_length = max(len(tokenizer.encode(text)) for text in train_data[0].tolist() + train_data[1].tolist())
train_encodings1 = tokenizer(train_data_pairs[0].tolist(), truncation=True, padding='max_length', max_length = max_length, add_special_tokens = True, return_tensors = "pt")
train_encodings2 = tokenizer(train_data_pairs[1].tolist(), truncation=True, padding='max_length', max_length = max_length, add_special_tokens = True, return_tensors = "pt")
train_labels = torch.tensor(train_data_labels.tolist())

test_encodings1 = tokenizer(test_data_pairs[0].tolist(), truncation=True, padding='max_length', max_length = max_length, add_special_tokens = True, return_tensors = "pt")
test_encodings2 = tokenizer(test_data_pairs[1].tolist(), truncation=True, padding='max_length', max_length = max_length, add_special_tokens = True, return_tensors = "pt")
test_labels = torch.tensor(test_data_labels.tolist())

train_dataset = TensorDataset(train_encodings1['input_ids'], train_encodings1['attention_mask'],
                              train_encodings2['input_ids'], train_encodings2['attention_mask'],
                              train_labels)
test_dataset = TensorDataset(test_encodings1['input_ids'], test_encodings1['attention_mask'],
                                   test_encodings2['input_ids'], test_encodings2['attention_mask'],
                                   test_labels)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)