I have run lots of mbart finetuning experiments and am moving on to pegasus/marian so wanted to share some general tips. I ran them all on en-ro because I knew what fairseq scores on that pair, since thats the only finetuned checkpoint they released.
- Best test BLEU I got from finetuning was 26.42. the fairseq-converted model gets 26.81.
-
--freeze_embeds
does not hurt metrics and saves lots of memory. Always use this - Got 26.32 on 8x V100 GPU on master on Aug 21 (
f230a640
). Took 10h32mins before I killed. Maybe shouldn’t have killed - post-processing in
romanian_postprocessing.md
- distillation works well - slightly better with a teacher. Posted sshleifer/distilmbart-enro.
- Probably still best to used Marian, which scored on wmt-en-ro test
27.7/37.4 in 90 Seconds vs mbart-large-en-ro 26.8/37.1 6 minutes. The distilled 12-6 is roughly 26.1 3 mins. Distilled 12-4 is 25.9/2 minutes. I wonder why marian is so good. I guess pretraining is not convincingly a benefit in machine translation yet. There may also be a leak in the marian data, but the original author Jorg Tiedemann doesn’t think so. Still, these metrics made me think I should focus much less on distilling/finetuning for mbart and more on supporting training MT from scratch, distilling marian even smaller.
Unsolved:
7. Using --joined-dictionary
in fairseq and trimming embeddings should make training much faster, but I couldn’t get sentencepiece SetVocabulary doing this to make the correct/restricted vocabulary. The sentencepiece maintainers ignore my issue, so may post something detailed. Don’t know who I’d sent that to.
8. Had a good run with --decoder_layerdrop=0.3, but subsequent distillation wasn’t any faster/better. Wierd.
9. Can get a 30-40% speedup with dynamic batch size, might send a PR for that. It’s default in fairseq.
10. Getting the decoder_input_ids to look exactly like fairseq (rather than off by 1) doesn’t change metrics at all (afaict).
Lessons learned:
- Spent too much time trying different hparams rather than zooming out and thinking about what I wanted to accomplish in this project. We still got a lot out of it – less memory hungry dataset, better command line args, support for MT in seq2seq/finetune.py.
- Should have run marian eval earlier before I put so much time into mbart.