@patrickvonplaten - What is the best loss you have achieved for RobertaShared or BertGPT2. As per my experiments, loss is stuck around 6 for long time. The notebook I have used is same you have shared.
Thanks.
@patrickvonplaten - What is the best loss you have achieved for RobertaShared or BertGPT2. As per my experiments, loss is stuck around 6 for long time. The notebook I have used is same you have shared.
Thanks.
@patrickvonplaten - Hi . I have trained a model as per this tutorial. https://huggingface.co/patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16 .
But its not working. As I mentioned above, loss is stuck around 6 and its not getting minimized further. Any ideas ? Model hasntâ learned. anything
@patrickvonplaten - One more doubt. As per the example given in the roberta-share-cnn
link, we have decoder_attention_mask
created in map_to_encoder_decoder_inputs
function right. My doubt is, if we use decoder_attention_mask
, it will be using in the self attention
calculation of decoder inputs. That actually violates the principle right, because decoder self attention mask should be a lower-triangular-matrix
as it is supposed to follow causal
masking. Right?
Sorry, I have to update those model cards. Will try to do this next week. I got good results with RobertaShared. This is a better RobertaShared notebook: https://colab.research.google.com/drive/1vHZHXOCFqOXIvdsF8j4WBRaAOAjAroTi?usp=sharing
@s4sarath @patrickvonplaten I followed the tutorials to train xlmr2xlmr or xlmr2share model from https://colab.research.google.com/drive/1vHZHXOCFqOXIvdsF8j4WBRaAOAjAroTi?usp=sharing but loss is stuck around 5. Works good for mbert2mbert and mbert2share. @s4sarath how you resolve in your case? @patrickvonplaten any suggestion? (@patrickvonplaten great work in this line.)
I would just try to play around with the learning rate and other hyperparameters (warmup_steps) until you see a better loss curve
Sure @patrickvonplaten let me try. Thank you for the suggestion.
@kaushal I agree with @patrickvonplaten . These models are super sensitive to learning rate. I will start with 1e-5.
if you want an example
PS: I m not trying to promote. Just an indication of help. Do the same thing in hugging face
Thanks, @patrickvonplaten @s4sarath for your suggestions.
I spent a lot of time playing with hyperparameters, seems nothing works. Still, the loss is stuck on value range 4-6.
Tasks: Abstractive text summarization (ATS) and news headline generation (NHG) in Hindi (supervised learning setting).
Models I tried:
Model ATS NHG
mMBERT2mBERT Yes Yes
mMBERT2Share Yes Yes
mMBERT2Rand Yes Yes
xlmr2xlmr No No
xlmr2share No No
xlmr2rand No No
muril2muril No No
muril2share No No
muril2rand No No
Yes â Model work
No â Model doesnât work
Can you please further suggest or provide some directions? It will be helpful. Thank you!
I think it will change based on models too @kaushal .
Use linear decay with learning rate . Play around with 0.0001, 0.001, 2e-5. Something on this line has to work.
But again, these are super sensitive to optimisers.
I even mentioned in the above notebooks âthis as a commentâ, long time back. Because this is how it is.
If your task is Summarization, use T5 small. Just a recommendation.
Hi @patrickvonplaten ,
You said: âIâm planning on making two short notebooks on Roberta2GPT2 for sentence fusion (DiscoFuse) and a Bert2Rnd for WMT en-de. Hope this is useful!â
Iâm interested in Bert2Rnd for WMT en-de, did you finish the tutorial?
I am getting error while training bert2bert.
Cuda : peer mapping resources exhausted
Any idea why I am getting this. Usually HF takes care of distributed training itself. I am training on 16 gpus.
Hey-- not sure if this is what you are running into, but there used to be [and as far as I know, still is] a hard limit in peer2peer data sharing of 8 or less GPUâs for CUDA, unless you use NCLL primitives and some type of message passing interface (like MPICH2, etc.) Try it with 8 and see if the error persists, and then try it with 9 and see if you get it again, simple test to see if thatâs the issue. Good luck.
I always have to wrap my models in pytorch in nn.DataParallel() so that I can run robert2roberta shared with a batch size of 32 because the model wonât fit on one TitanXP.
Hi @patrickvonplaten ,
In your notebook for training EncoderDecoder model, you defined function compute_metrics then pass it to the Trainer.
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = tokenizer.pad_token_id
label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
return {
"rouge2_precision": round(rouge_output.precision, 4),
"rouge2_recall": round(rouge_output.recall, 4),
"rouge2_fmeasure": round(rouge_output.fmeasure, 4),
}
My questions are
Please help me to understand it.
Thank you.