Leveraging pre-trained checkpoints for summarization

@patrickvonplaten - What is the best loss you have achieved for RobertaShared or BertGPT2. As per my experiments, loss is stuck around 6 for long time. The notebook I have used is same you have shared.


@patrickvonplaten - Hi . I have trained a model as per this tutorial. https://huggingface.co/patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16 .
But its not working. As I mentioned above, loss is stuck around 6 and its not getting minimized further. Any ideas ? Model hasnt’ learned. anything

@patrickvonplaten - One more doubt. As per the example given in the roberta-share-cnn link, we have decoder_attention_mask created in map_to_encoder_decoder_inputs function right. My doubt is, if we use decoder_attention_mask, it will be using in the self attention calculation of decoder inputs. That actually violates the principle right, because decoder self attention mask should be a lower-triangular-matrix as it is supposed to follow causal masking. Right?

Sorry, I have to update those model cards. Will try to do this next week. I got good results with RobertaShared. This is a better RobertaShared notebook: https://colab.research.google.com/drive/1vHZHXOCFqOXIvdsF8j4WBRaAOAjAroTi?usp=sharing


@s4sarath @patrickvonplaten I followed the tutorials to train xlmr2xlmr or xlmr2share model from https://colab.research.google.com/drive/1vHZHXOCFqOXIvdsF8j4WBRaAOAjAroTi?usp=sharing but loss is stuck around 5. Works good for mbert2mbert and mbert2share. @s4sarath how you resolve in your case? @patrickvonplaten any suggestion? (@patrickvonplaten great work in this line.)

1 Like

I would just try to play around with the learning rate and other hyperparameters (warmup_steps) until you see a better loss curve

1 Like

Sure @patrickvonplaten let me try. Thank you for the suggestion.

1 Like

@kaushal I agree with @patrickvonplaten . These models are super sensitive to learning rate. I will start with 1e-5.

if you want an example

PS: I m not trying to promote. Just an indication of help. Do the same thing in hugging face

1 Like

Thanks, @patrickvonplaten @s4sarath for your suggestions.

I spent a lot of time playing with hyperparameters, seems nothing works. Still, the loss is stuck on value range 4-6.

Tasks: Abstractive text summarization (ATS) and news headline generation (NHG) in Hindi (supervised learning setting).

Models I tried:
mMBERT2Share Yes Yes
mMBERT2Rand Yes Yes
xlmr2xlmr No No
xlmr2share No No
xlmr2rand No No
muril2muril No No
muril2share No No
muril2rand No No

Yes → Model work
No → Model doesn’t work

Can you please further suggest or provide some directions? It will be helpful. Thank you!

I think it will change based on models too @kaushal .
Use linear decay with learning rate . Play around with 0.0001, 0.001, 2e-5. Something on this line has to work.

But again, these are super sensitive to optimisers.
I even mentioned in the above notebooks “this as a comment”, long time back. Because this is how it is.

If your task is Summarization, use T5 small. Just a recommendation.

1 Like

Hi @patrickvonplaten ,

You said: “I’m planning on making two short notebooks on Roberta2GPT2 for sentence fusion (DiscoFuse) and a Bert2Rnd for WMT en-de. Hope this is useful!”

I’m interested in Bert2Rnd for WMT en-de, did you finish the tutorial?

1 Like

I am getting error while training bert2bert.
Cuda : peer mapping resources exhausted
Any idea why I am getting this. Usually HF takes care of distributed training itself. I am training on 16 gpus.

Hey-- not sure if this is what you are running into, but there used to be [and as far as I know, still is] a hard limit in peer2peer data sharing of 8 or less GPU’s for CUDA, unless you use NCLL primitives and some type of message passing interface (like MPICH2, etc.) Try it with 8 and see if the error persists, and then try it with 9 and see if you get it again, simple test to see if that’s the issue. Good luck.

I always have to wrap my models in pytorch in nn.DataParallel() so that I can run robert2roberta shared with a batch size of 32 because the model won’t fit on one TitanXP.

Hi @patrickvonplaten ,
In your notebook for training EncoderDecoder model, you defined function compute_metrics then pass it to the Trainer.

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),

My questions are

  1. What is the purpose of this function and when will it be called?
  2. Which param tell the trainer to use all 3 metrics or one of them while training?
  3. Can I obmit this function if I set metric_for_best_model = ‘eval_loss’

Please help me to understand it.
Thank you.