LSTM Encoder-Decoder not working

I am trying to train an LSTM Encoder-Decoder model for paraphrase generation. My model is as follows:

StackedResidualLSTM(
  (encoder): RecurrentEncoder(
    (embed_tokens): Embedding(30522, 256)
    (dropout): Dropout(p=0.5, inplace=False)
    (rnn): LSTM(256, 256, num_layers=2, batch_first=True, dropout=0.5)
  )
  (decoder): RecurrentDecoder(
    (embed_tokens): Embedding(30522, 128)
    (dropout_in_module): Dropout(p=0.5, inplace=False)
    (dropout_out_module): Dropout(p=0.1, inplace=False)
    (layers): ModuleList(
      (0): LSTMCell(384, 256)
      (1): LSTMCell(256, 256)
    )
    (fc_out): Linear(in_features=256, out_features=30522, bias=True)
  )
)

Following is a print of the source sentence, the sentence fed to the decoder (shifted right), the predictions, and the true sentence (labels). Everything is tokenized with BERT tokenizer:

Source: [CLS] where can i get quality services in brisbane for plaster
and drywall repair? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD]

Decoder Input: [CLS] [CLS] where can i get
quality services for plaster and drywall repairs in brisbane? [SEP]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

Preds:
[CLS] the? [SEP]? [SEP]? [SEP]? [SEP]? [SEP]? [SEP]? [SEP]? [SEP]?
[SEP]

Target: [CLS] where can i get quality services for plaster and
drywall repairs in brisbane? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[PAD] [PAD] [PAD] [PAD] [PAD]

My loss function is a CrossEntropy between the output and labels (the padding token is switched with -100 to ignore). Something like:

loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))

There are two problems occurring:

  • the loss does not go down
  • the generations are all the same for every entry of the same epoch (after weight updating the generations might be different than the ones from the previous epoch, but remain the same for every entry of the new epoch)

Do you have any idea what might I try to fix the issue? Thanks in advance for any help you can provide.