Pegasus finetuning, should we always start with pegasus-large?

I’m fine-tuning pegasus on my own data, which is about 15,000 examples.

I am finding, when fine-tuning Pegasus, using pegasus-large , that the RAM requirements for even just a batch size of 1 are so extreme, that a Nvidia card with 16GB of memory is required… just to run the batch size of 1 ! So at this point I am thinking that maybe my training will run better on the CPU, using a machine with a huge amount of ram… like 512GB of ram… as this seems to allow a much bigger batch size, like up to 64 or 128 .

My guess is that the RAM requirements are so extreme because I am using pegasus-large. I’m doing this based on my understanding of this page:
: Pegasus

All the checkpoints are fine-tuned for summarization, besides pegasus-large, whence the other checkpoints are fine-tuned

My understanding from this is that, if we, as the newbie user, have some data we want to use with Pegasus, we should do this:

  1. Start with pegasus-large: google/pegasus-large · Hugging Face
  2. Fine tune it on our own data
  3. Use the pytorch_model.bin output from this fine tuning process to run inference on our own data.

Am I getting something wrong here? Given that I have 15,000 examples, have I made the correct determination that I should fine-tune pegasus-large, and that this will lead to the best results, even though the memory requirements are huge?

I looked for distilled model, here: Models - Hugging Face

… But my understanding (possibly wrong?) is that these distilled models are ALREADY fine-tuned, so they would not be appropriate to use, given that I have a lot of my OWN data to fine-tune with.

Thanks!

To answer your second questions,

the student models are a smaller version of the finetuned-pegasus models, created by choosing alternating layers from the decoder. This method is described in the Pre-trained Summarization Distillation. To use these models you should fine-tune them

Thanks for this. Shows that I am a newbie but I am still confused about this.

Given that I have 15,000 examples, of my own data, to work with, I am still unclear if I should be using pegasus-large, even though it has huge memory requirements, or if there is a distilled version of pegasus that I can use for fine-tuning on my own data.

It seems somehow illogical for me to START my fine tuning on a model that has ALREADY been fine-tuned on another task… especially because my summarization task doesn’t closely match the exists tasks like xsum, cnn, or etc.

I have been studying ML now for 2 months, and this basic question has escaped me: "Should I always start with a large model, and fine-tune that, or is it general practice to start with a distilled model, which has already been fine-tuned, and essentially make ANOTHER layer of fine-tuning, on top of that?

And, yes, regarding these student models, I have looked at them, like this one:

But again, it looks like these student models have already been fine-tuned, for example here, seems to be fine-tuned on xsum.

it looks like these student models have already been fine-tuned,

These are created using the large fine-tuned model.

You could also fine-tune the large model, and then distill that model into a student and fine-tune some more to get smaller model.