Pruning a model embedding matrix for memory efficiency

Okay, so I’ve worked everything out but the tokenizer. The model can be pruned and trained to perform quite well. Like I said above, I was getting extremely bad results, but it turns out that was due to my learning rate of 1e-5 being too high. I finally settled on a learning rate of 1e-8, and the model now actually converges. I feel that adding an lr scheduler with warmup, like on the fairseq models will resolve this issue, but I’m not sure how to do that with the Seq2SeqTrainer yet.

I still don’t know how to create a new tokenizer, but for the time being I’ve just defined a custom tokenizer that inherits from the main MBart50TokenizerFast class and adds a three functions - one to add the mapping of the old dictionary to the pruned dictionary, and two to encode and decode using this new dictionary. This may not be the “correct” way (by producing a new sentencepiece model), but works well enough in my opinion. I’m trying to figure that out but I have been unable to yet.

I would like to upload the pruned and finetuned model to the Model hub, but I’m unsure how that can be done without making a new sentencepiece model.