mBART finetuning tips/post-mortem

sshleifer · August 24, 2020, 7:39pm

I have run lots of mbart finetuning experiments and am moving on to pegasus/marian so wanted to share some general tips. I ran them all on en-ro because I knew what fairseq scores on that pair, since thats the only finetuned checkpoint they released.

Best test BLEU I got from finetuning was 26.42. the fairseq-converted model gets 26.81.
--freeze_embeds does not hurt metrics and saves lots of memory. Always use this
Got 26.32 on 8x V100 GPU on master on Aug 21 (f230a640). Took 10h32mins before I killed. Maybe shouldn’t have killed
post-processing in romanian_postprocessing.md
distillation works well - slightly better with a teacher. Posted sshleifer/distilmbart-enro.
Probably still best to used Marian, which scored on wmt-en-ro test
27.7/37.4 in 90 Seconds vs mbart-large-en-ro 26.8/37.1 6 minutes. The distilled 12-6 is roughly 26.1 3 mins. Distilled 12-4 is 25.9/2 minutes. I wonder why marian is so good. I guess pretraining is not convincingly a benefit in machine translation yet. There may also be a leak in the marian data, but the original author Jorg Tiedemann doesn’t think so. Still, these metrics made me think I should focus much less on distilling/finetuning for mbart and more on supporting training MT from scratch, distilling marian even smaller.

Unsolved:
7. Using --joined-dictionary in fairseq and trimming embeddings should make training much faster, but I couldn’t get sentencepiece SetVocabulary doing this to make the correct/restricted vocabulary. The sentencepiece maintainers ignore my issue, so may post something detailed. Don’t know who I’d sent that to.
8. Had a good run with --decoder_layerdrop=0.3, but subsequent distillation wasn’t any faster/better. Wierd.
9. Can get a 30-40% speedup with dynamic batch size, might send a PR for that. It’s default in fairseq.
10. Getting the decoder_input_ids to look exactly like fairseq (rather than off by 1) doesn’t change metrics at all (afaict).

Lessons learned:

Spent too much time trying different hparams rather than zooming out and thinking about what I wanted to accomplish in this project. We still got a lot out of it – less memory hungry dataset, better command line args, support for MT in seq2seq/finetune.py.
Should have run marian eval earlier before I put so much time into mbart.

valhalla · August 25, 2020, 9:24am

This is weird, even if pre-training doesn’t help, it should perform at least similar or better than Marian given the large architecture. Could multilinguality be the reason for this perf drop ?
dynamic batching will be a great addition if it gives speed-up. Can’t fine-tune large models reliably on colab. I want to get bart/MBart working on TPU with Seq2Seq trainer, very hard to get interesting results with large models without huge compute.
I knew it

ben9004 · August 25, 2020, 9:49am

Thanks a lot for hard-working on mBart.

Parth · November 16, 2020, 5:45am

Hey @valhalla and @sshleifer, I am working on the problem of comment generation on youtube videos using MBART.
eg;
Title: Tips for Men Grooming (Input statement would be video title)
Comment: बोहोत सही! I love it, Nice tips (Comment in multi langauge)
Can mBart be finetuned for such a task provided I have large dataset?

valhalla · November 16, 2020, 10:01am

Hi @Parth

We (me and Sam) haven’t got any interesting results for MBart yet so won’t be able to answer concretely , so I think you’ll need to try it for yourself. Also this is mixed language problem so some experimentation is needed to see if such multilingual models work for mixed languages with just fine-tuning

Parth · November 17, 2020, 12:35pm

Hey @valhalla, Can we finetune MBART on colab for such task.Is there sample script for fine-tuning MBART?

valhalla · November 17, 2020, 2:00pm

Yes! refer to examples/seq2seq https://github.com/huggingface/transformers/tree/master/examples/seq2seq

It supports fine-tuning seq2seq models in the library. The readme has all the details.
Supports both pytorch-lightning and native Trainer

Topic		Replies	Views
Fine-tuning seq2seq: Helsinki-NLP 🤗Transformers	4	2267	December 8, 2020
Translation takes too long - from fine-tuned mbart-large-50 model Beginners	0	407	September 7, 2021
How to finetune mT5 🤗Transformers	0	626	July 19, 2021
mBART fine tuning performs worse Beginners	0	27	November 22, 2024
Seq2SeqTrainer Questions 🤗Transformers	12	5265	August 18, 2022

mBART finetuning tips/post-mortem

Related topics