Leveraging pre-trained checkpoints for summarization

@patrickvonplaten - What is the best loss you have achieved for RobertaShared or BertGPT2. As per my experiments, loss is stuck around 6 for long time. The notebook I have used is same you have shared.

Thanks.

@patrickvonplaten - Hi . I have trained a model as per this tutorial. https://huggingface.co/patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16 .
But its not working. As I mentioned above, loss is stuck around 6 and its not getting minimized further. Any ideas ? Model hasnt’ learned. anything

@patrickvonplaten - One more doubt. As per the example given in the roberta-share-cnn link, we have decoder_attention_mask created in map_to_encoder_decoder_inputs function right. My doubt is, if we use decoder_attention_mask, it will be using in the self attention calculation of decoder inputs. That actually violates the principle right, because decoder self attention mask should be a lower-triangular-matrix as it is supposed to follow causal masking. Right?

Sorry, I have to update those model cards. Will try to do this next week. I got good results with RobertaShared. This is a better RobertaShared notebook: https://colab.research.google.com/drive/1vHZHXOCFqOXIvdsF8j4WBRaAOAjAroTi?usp=sharing

2 Likes

@s4sarath @patrickvonplaten I followed the tutorials to train xlmr2xlmr or xlmr2share model from https://colab.research.google.com/drive/1vHZHXOCFqOXIvdsF8j4WBRaAOAjAroTi?usp=sharing but loss is stuck around 5. Works good for mbert2mbert and mbert2share. @s4sarath how you resolve in your case? @patrickvonplaten any suggestion? (@patrickvonplaten great work in this line.)

1 Like

I would just try to play around with the learning rate and other hyperparameters (warmup_steps) until you see a better loss curve

1 Like

Sure @patrickvonplaten let me try. Thank you for the suggestion.

1 Like

@kaushal I agree with @patrickvonplaten . These models are super sensitive to learning rate. I will start with 1e-5.

if you want an example

PS: I m not trying to promote. Just an indication of help. Do the same thing in hugging face

1 Like

Thanks, @patrickvonplaten @s4sarath for your suggestions.

I spent a lot of time playing with hyperparameters, seems nothing works. Still, the loss is stuck on value range 4-6.

Tasks: Abstractive text summarization (ATS) and news headline generation (NHG) in Hindi (supervised learning setting).

Models I tried:
Model ATS NHG
mMBERT2mBERT Yes Yes
mMBERT2Share Yes Yes
mMBERT2Rand Yes Yes
xlmr2xlmr No No
xlmr2share No No
xlmr2rand No No
muril2muril No No
muril2share No No
muril2rand No No

Yes → Model work
No → Model doesn’t work

Can you please further suggest or provide some directions? It will be helpful. Thank you!

I think it will change based on models too @kaushal .
Use linear decay with learning rate . Play around with 0.0001, 0.001, 2e-5. Something on this line has to work.

But again, these are super sensitive to optimisers.
I even mentioned in the above notebooks “this as a comment”, long time back. Because this is how it is.

If your task is Summarization, use T5 small. Just a recommendation.

1 Like

Hi @patrickvonplaten ,

You said: “I’m planning on making two short notebooks on Roberta2GPT2 for sentence fusion (DiscoFuse) and a Bert2Rnd for WMT en-de. Hope this is useful!”

I’m interested in Bert2Rnd for WMT en-de, did you finish the tutorial?

1 Like

I am getting error while training bert2bert.
Cuda : peer mapping resources exhausted
Any idea why I am getting this. Usually HF takes care of distributed training itself. I am training on 16 gpus.

Hey-- not sure if this is what you are running into, but there used to be [and as far as I know, still is] a hard limit in peer2peer data sharing of 8 or less GPU’s for CUDA, unless you use NCLL primitives and some type of message passing interface (like MPICH2, etc.) Try it with 8 and see if the error persists, and then try it with 9 and see if you get it again, simple test to see if that’s the issue. Good luck.

I always have to wrap my models in pytorch in nn.DataParallel() so that I can run robert2roberta shared with a batch size of 32 because the model won’t fit on one TitanXP.

Hi @patrickvonplaten ,
In your notebook for training EncoderDecoder model, you defined function compute_metrics then pass it to the Trainer.

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

My questions are

  1. What is the purpose of this function and when will it be called?
  2. Which param tell the trainer to use all 3 metrics or one of them while training?
  3. Can I obmit this function if I set metric_for_best_model = ‘eval_loss’

Please help me to understand it.
Thank you.