Train loss is not decreasing on siamese model based on xlm-roberta

My problem: I’m trying to train a siamese model based on xlm-roberta-base for semantic sentence comparsion. I’ve noticed that the loss on the training data fluctuates but doesn’t decrease. The same happens on the test data. I tried to change the learning rate. I tried 1e-1, 1e-5, 1e-22. The only effect I saw was that the test loss fluctuated with a smaller amplitude. But the train loss fluctuated with the same amplitude. I tried to use such solvers as SGD and Adam. And I had the same problem with both. I have weight_decay = 0. Also, I’ve noticed that almost all values in the outputs of my siamese model tend to 0.99. I can’t even overfit the model on a small train dataset. Please help me to find the mistake.

My dataset consists of pairs of sentences and continuous similarity coefficient in interval from 0 to 1. I tried to use sigmoid function but stopped on (cos_sim + 1)/2 because this choice doesn’t affect gloabally on anything

Remark: I’ve tried a different approach. I trained an ordinary xlm-roberta-base model where the input vector contains two tokenized sentences. I used the model “xlm-roberta for sequence classification” with one output neuron. This approach worked perfectly on my dataset. So I can say that there is no problem with my data. But this is a slow approach for brute force pair comparison. Because of that, I want to train a siamese model.
My criterion:

criterion = nn.MSELoss()

Here’s my siamese’s model architecture:

class SiameseNetwork(nn.Module):
    def __init__(self, roberta):
        super().__init__()
        self.roberta = roberta
        self.cosine_similarity = nn.CosineSimilarity(dim=1)

    def forward(self, input_ids1, attention_mask1, input_ids2, attention_mask2):
        out1 = self.roberta(input_ids=input_ids1, attention_mask=attention_mask1)[0][:, 0]
        out2 = self.roberta(input_ids=input_ids2, attention_mask=attention_mask2)[0][:, 0]

        cos_sim = self.cosine_similarity(out1, out2)
        del out1, out2
        gc.collect()
        torch.cuda.empty_cache()
        return (cos_sim + 1)/2

Here’s my code for train:

siamese.train()

for epoch in range(25):
  total_mse_loss = 0
  for batch in train_loader:
    optimizer.zero_grad()
    input_ids1 = batch[0].to(device)
    attention_mask1 = batch[1].to(device)
    input_ids2 = batch[2].to(device)
    attention_mask2 = batch[3].to(device)
    labels = batch[4].to(device)

    outputs = siamese(input_ids1, attention_mask1, input_ids2, attention_mask2)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

    total_mse_loss += loss.item()
    del input_ids1, input_ids2, attention_mask1, attention_mask2, labels, outputs
    torch.cuda.empty_cache()
    gc.collect()
  avg_train_mse = total_mse_loss / len(train_loader)
  print(f"Epoch {epoch+1} - Training MSE: {avg_train_mse}")

  siamese.eval()
  test_mse_loss = 0
  for batch in test_loader:
    with torch.no_grad():
        input_ids1 = batch[0].to(device)
        attention_mask1 = batch[1].to(device)
        input_ids2 = batch[2].to(device)
        attention_mask2 = batch[3].to(device)
        labels = batch[4].to(device)

        outputs = siamese(input_ids1, attention_mask1, input_ids2, attention_mask2)
        loss = criterion(outputs, labels)

        test_mse_loss += loss.item()
        del input_ids1, input_ids2, attention_mask1, attention_mask2, labels, outputs
        torch.cuda.empty_cache()
        gc.collect()
  test_mse_loss /= len(test_loader)
  print(f"Epoch {epoch+1} - Test Loss: {test_mse_loss}\n")
  siamese.train()

I’ve loaded data such way:

response = requests.get('GSheets link')
response.encoding = 'utf-8'
train_data = pd.read_csv(io.StringIO(response.text), header=None).iloc[:,[0,1,2]]#.drop_duplicates(subset=[0, 1])

train_data[0] = train_data[0].str.lower()#.astype(str).apply(process_text)
train_data[1] = train_data[1].str.lower()#.astype(str).apply(process_text)
train_data[2] = train_data[2].astype("float32")

train_data_pairs, test_data_pairs, train_data_labels, test_data_labels = train_test_split(train_data.iloc[:, [0, 1]], train_data.iloc[:, 2], test_size=0.15)

print(f"Len of train data: {len(train_data_pairs)}")
print(f"Len of test data: {len(test_data_pairs)}")

max_length = max(len(tokenizer.encode(text)) for text in train_data[0].tolist() + train_data[1].tolist())
train_encodings1 = tokenizer(train_data_pairs[0].tolist(), truncation=True, padding='max_length', max_length = max_length, add_special_tokens = True, return_tensors = "pt")
train_encodings2 = tokenizer(train_data_pairs[1].tolist(), truncation=True, padding='max_length', max_length = max_length, add_special_tokens = True, return_tensors = "pt")
train_labels = torch.tensor(train_data_labels.tolist())

test_encodings1 = tokenizer(test_data_pairs[0].tolist(), truncation=True, padding='max_length', max_length = max_length, add_special_tokens = True, return_tensors = "pt")
test_encodings2 = tokenizer(test_data_pairs[1].tolist(), truncation=True, padding='max_length', max_length = max_length, add_special_tokens = True, return_tensors = "pt")
test_labels = torch.tensor(test_data_labels.tolist())

train_dataset = TensorDataset(train_encodings1['input_ids'], train_encodings1['attention_mask'],
                              train_encodings2['input_ids'], train_encodings2['attention_mask'],
                              train_labels)
test_dataset = TensorDataset(test_encodings1['input_ids'], test_encodings1['attention_mask'],
                                   test_encodings2['input_ids'], test_encodings2['attention_mask'],
                                   test_labels)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

@WpythonW have you found the answer to you question?

This is very similar to what I faced long time ago.
BERT or RoBERTa (base or large) models worked fine for sent embedding contrastive learning (similar to your siamese model), converged quickly.
But when switched to XLM RoBERTa, training loss can’t go down in the early epochs, until more epochs, the loss started decreasing.

If I can speculate, let’s compare XLM-R with RoBERTa large model, both are RoBERTa (no NSP loss), so I don’t think NSP has anything to do with it, and XLM-R was pretrained with multi-lingual data, so I guess this might be relevant to the slow convergence.