Hardware Requirements | Fine tuning Pegasus Large

I am using the code here for fine tuning pegasus large. I am doing this on an EC2 instance with 8 vCPU of 12 GB’s each. So total 96 GB’s across all 8 GPU’s. However I still get Cudo Out of Memory error even with a batch size of 1.
I would like to know if there’s any recommended hardware requirements for fine tuning Pegasus Large?
I have a dataset with around 1000 articles and their summaries.
Does the Trainer Class of Hugging Face make use of all GPU’s? Is there some setting or parameters that we need to set to make this happen?
I found this issue on the original github repo of Google. Not sure if this is applicable for Hugging Face libraries.