CUDA out of memory error while predicting (evaluation)


I have experimented with fine-tuning ‘facebook/bart-large-mnli’ model. Since it is very big, I barely fit it in the GPU, had to reduce to batch_size = 1 through trainingarguments, but I suceeded at training. However, now that I want to simply produce the outputs with that model, I keep on running into CUDA out of memory errors. I tried out multiple steps but nothing helped. I only found out that if I try to do the predictions on dataset of size 1, it works, but the data is still stored in the GPU despite the fact that I have torch.inference_mode()/torch.no_grad().

It feels like the problem is in the way information is handled by the model by storing it into GPU even though I explicitly try to avoid that. Any suggestions?

Thanks for help, this issue has been bugging me for over a week and I cant find any solutions.