EncoderDecoder LM output is perfect ... except that the ending is missing or duplicated

Hello the community,

I have a problem which is way too strange to not to be easily debugged but i just can’t find the problem. Could I use your help ?

I am creating this model:

  • Input are phonems (eg: b§ZuR)
  • Output is a proper sentence (in medical French, eg: Bonjour)
  • The model is an EncoderDecoder with a Bert from scratch as encoder, and a BertLMHeadModel pretrained from ‘Geotrend/bert-base-fr-cased’ as decoder.
  • I am fine-tuning on some medical french text corpus, phonetized with a phonemic dictionary to get the inputs

The input masks are non-causal and only mask the padding tokens
The output masks are the usual pyramidal masks

Output is obtained with:
model.generate(input_values, decoder_start_token_id=0, eos_token_id=2, pad_token_id=1, num_beams=5, early_stopping=True, max_length=100)[0]

The training goes super well, and I get a rather good 0.17 cross entropy loss at the end. However:
The output is most often missing, sometimes duplicated or random

Examples:

expected target
prediction

suspicion de lipome du cordon droit
[PAD] suspicion de lipome du cordon droit de lipome du cordon droit droit [SEP] [unused2]

pas de phénomènes inflammatoires muqueux significatifs
[PAD] pas de phénomène inflammatoire [SEP] [unused2]

pincement plus marqué du disque intervertébral , mais inchangé par rapport à l' irm précédente
[PAD] pincement plus marqué du disque intervertébral, mais inchangé par rapport à

radiographie bassin face debout et hanche droite
[PAD] radiographie bassin face debout [SEP] [unused2]

homogène de type stéatosique , sans lésion focale notable
[PAD] homogène de type stéatosique, sans lésion focale notable, sans lésion focale notable, [SEP] [unused2]

Do you have any idea of what could be going wrong ?