Leveraging pre-trained checkpoints for summarization

s4sarath · December 5, 2020, 3:16pm

@patrickvonplaten - What is the best loss you have achieved for RobertaShared or BertGPT2. As per my experiments, loss is stuck around 6 for long time. The notebook I have used is same you have shared.

Thanks.

s4sarath · December 6, 2020, 10:43am

@patrickvonplaten - Hi . I have trained a model as per this tutorial. https://huggingface.co/patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16 .
But its not working. As I mentioned above, loss is stuck around 6 and its not getting minimized further. Any ideas ? Model hasnt’ learned. anything

s4sarath · December 7, 2020, 7:26am

@patrickvonplaten - One more doubt. As per the example given in the roberta-share-cnn link, we have decoder_attention_mask created in map_to_encoder_decoder_inputs function right. My doubt is, if we use decoder_attention_mask, it will be using in the self attention calculation of decoder inputs. That actually violates the principle right, because decoder self attention mask should be a lower-triangular-matrix as it is supposed to follow causal masking. Right?

patrickvonplaten · December 7, 2020, 6:22pm

Sorry, I have to update those model cards. Will try to do this next week. I got good results with RobertaShared. This is a better RobertaShared notebook: https://colab.research.google.com/drive/1vHZHXOCFqOXIvdsF8j4WBRaAOAjAroTi?usp=sharing

kaushal · March 19, 2021, 6:27am

@s4sarath @patrickvonplaten I followed the tutorials to train xlmr2xlmr or xlmr2share model from https://colab.research.google.com/drive/1vHZHXOCFqOXIvdsF8j4WBRaAOAjAroTi?usp=sharing but loss is stuck around 5. Works good for mbert2mbert and mbert2share. @s4sarath how you resolve in your case? @patrickvonplaten any suggestion? (@patrickvonplaten great work in this line.)

patrickvonplaten · March 19, 2021, 6:36am

I would just try to play around with the learning rate and other hyperparameters (warmup_steps) until you see a better loss curve

kaushal · March 19, 2021, 7:07am

Sure @patrickvonplaten let me try. Thank you for the suggestion.

s4sarath · March 19, 2021, 10:15am

@kaushal I agree with @patrickvonplaten . These models are super sensitive to learning rate. I will start with 1e-5.

if you want an example

github.com

legacyai/tf-transformers/blob/main/src/tf_transformers/notebooks/tutorials/seq2seq_summarization.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install tf-transformers from github"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Roberta2Roberta + Summarization + Xsum\n",
    "\n",
    "This tutorial contains code to fine-tune an Roberta2Roberta Encoder Decoder Model for Summarization\n",
    "\n",
    "In this notebook:\n",

This file has been truncated. show original

PS: I m not trying to promote. Just an indication of help. Do the same thing in hugging face

kaushal · March 24, 2021, 7:50am

Thanks, @patrickvonplaten @s4sarath for your suggestions.

I spent a lot of time playing with hyperparameters, seems nothing works. Still, the loss is stuck on value range 4-6.

Tasks: Abstractive text summarization (ATS) and news headline generation (NHG) in Hindi (supervised learning setting).

Models I tried:
Model ATS NHG
mMBERT2mBERT Yes Yes
mMBERT2Share Yes Yes
mMBERT2Rand Yes Yes
xlmr2xlmr No No
xlmr2share No No
xlmr2rand No No
muril2muril No No
muril2share No No
muril2rand No No

Yes → Model work
No → Model doesn’t work

Can you please further suggest or provide some directions? It will be helpful. Thank you!

s4sarath · March 24, 2021, 9:00am

I think it will change based on models too @kaushal .
Use linear decay with learning rate . Play around with 0.0001, 0.001, 2e-5. Something on this line has to work.

But again, these are super sensitive to optimisers.
I even mentioned in the above notebooks “this as a comment”, long time back. Because this is how it is.

If your task is Summarization, use T5 small. Just a recommendation.

yansoares · May 2, 2021, 1:58pm

Hi @patrickvonplaten ,

You said: “I’m planning on making two short notebooks on Roberta2GPT2 for sentence fusion (DiscoFuse) and a Bert2Rnd for WMT en-de. Hope this is useful!”

I’m interested in Bert2Rnd for WMT en-de, did you finish the tutorial?

Sahajtomar · May 23, 2021, 1:59pm

I am getting error while training bert2bert.
Cuda : peer mapping resources exhausted
Any idea why I am getting this. Usually HF takes care of distributed training itself. I am training on 16 gpus.

HodorTheCoder · June 7, 2021, 3:35pm

Hey-- not sure if this is what you are running into, but there used to be [and as far as I know, still is] a hard limit in peer2peer data sharing of 8 or less GPU’s for CUDA, unless you use NCLL primitives and some type of message passing interface (like MPICH2, etc.) Try it with 8 and see if the error persists, and then try it with 9 and see if you get it again, simple test to see if that’s the issue. Good luck.

I always have to wrap my models in pytorch in nn.DataParallel() so that I can run robert2roberta shared with a batch size of 32 because the model won’t fit on one TitanXP.

ithieund · November 25, 2022, 8:18am

Hi @patrickvonplaten ,
In your notebook for training EncoderDecoder model, you defined function compute_metrics then pass it to the Trainer.

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

My questions are

What is the purpose of this function and when will it be called?
Which param tell the trainer to use all 3 metrics or one of them while training?
Can I obmit this function if I set metric_for_best_model = ‘eval_loss’

Please help me to understand it.
Thank you.

Topic		Replies	Views
Warm-started encoder-decoder models (Bert2Gpt2 and Bert2Bert) Beginners	11	2487	June 9, 2024
Use EncoderDecoder models for text summarization 🤗 Course Projects	3	2396	December 28, 2023
BART from finetuned BERT Intermediate	2	472	September 9, 2021
Training issue of a Transformer based Encoder-Decoder model based on pre-trained BanglaBERT Models	1	728	May 12, 2022
How create BERT2Rand Encoder-Decoder model Models	2	1088	March 16, 2021

Leveraging pre-trained checkpoints for summarization

Related topics