mBART finetuning tips/post-mortem

I have run lots of mbart finetuning experiments and am moving on to pegasus/marian so wanted to share some general tips. I ran them all on en-ro because I knew what fairseq scores on that pair, since thats the only finetuned checkpoint they released.

  1. Best test BLEU I got from finetuning was 26.42. the fairseq-converted model gets 26.81.
  2. --freeze_embeds does not hurt metrics and saves lots of memory. Always use this
  3. Got 26.32 on 8x V100 GPU on master on Aug 21 (f230a640). Took 10h32mins before I killed. Maybe shouldn’t have killed :slight_smile:
  4. post-processing in
  5. distillation works well - slightly better with a teacher. Posted sshleifer/distilmbart-enro.
  6. Probably still best to used Marian, which scored on wmt-en-ro test
    27.7/37.4 in 90 Seconds vs mbart-large-en-ro 26.8/37.1 6 minutes. The distilled 12-6 is roughly 26.1 3 mins. Distilled 12-4 is 25.9/2 minutes. I wonder why marian is so good. I guess pretraining is not convincingly a benefit in machine translation yet. There may also be a leak in the marian data, but the original author Jorg Tiedemann doesn’t think so. Still, these metrics made me think I should focus much less on distilling/finetuning for mbart and more on supporting training MT from scratch, distilling marian even smaller.

7. Using --joined-dictionary in fairseq and trimming embeddings should make training much faster, but I couldn’t get sentencepiece SetVocabulary doing this to make the correct/restricted vocabulary. The sentencepiece maintainers ignore my issue, so may post something detailed. Don’t know who I’d sent that to.
8. Had a good run with --decoder_layerdrop=0.3, but subsequent distillation wasn’t any faster/better. Wierd.
9. Can get a 30-40% speedup with dynamic batch size, might send a PR for that. It’s default in fairseq.
10. Getting the decoder_input_ids to look exactly like fairseq (rather than off by 1) doesn’t change metrics at all (afaict).

Lessons learned:

  • Spent too much time trying different hparams rather than zooming out and thinking about what I wanted to accomplish in this project. We still got a lot out of it – less memory hungry dataset, better command line args, support for MT in seq2seq/
  • Should have run marian eval earlier before I put so much time into mbart.
  1. This is weird, even if pre-training doesn’t help, it should perform at least similar or better than Marian given the large architecture. Could multilinguality be the reason for this perf drop ?

  2. dynamic batching will be a great addition if it gives speed-up. Can’t fine-tune large models reliably on colab. I want to get bart/MBart working on TPU with Seq2Seq trainer, very hard to get interesting results with large models without huge compute.

  3. I knew it :wink:


Thanks a lot for hard-working on mBart. :smiley: :innocent: :heart_eyes:

1 Like

Hey @valhalla and @sshleifer, I am working on the problem of comment generation on youtube videos using MBART.
Title: Tips for Men Grooming (Input statement would be video title)
Comment: बोहोत सही! I love it, Nice tips (Comment in multi langauge)
Can mBart be finetuned for such a task provided I have large dataset?

Hi @Parth

We (me and Sam) haven’t got any interesting results for MBart yet so won’t be able to answer concretely , so I think you’ll need to try it for yourself. Also this is mixed language problem so some experimentation is needed to see if such multilingual models work for mixed languages with just fine-tuning

1 Like

Hey @valhalla, Can we finetune MBART on colab for such task.Is there sample script for fine-tuning MBART?

Yes! refer to examples/seq2seq

It supports fine-tuning seq2seq models in the library. The readme has all the details.
Supports both pytorch-lightning and native Trainer