GPU inference slows down if done in a loop

Hi I have noticed that inference time is very quick if running the model on one batch. However, once inference is ran in a loop - even if on the same input - it slows down significantly.

I have actually seen the same behaviour on tensorflow models. Is an expected behaviour or is an issue with cuda etc.

Please find the notebook to see the issue
https://colab.research.google.com/drive/1gqSzQqFm8HL0OwmJzSRlcRFQ3FOpnvFh?usp=sharing

2 Likes

This is because Python is a slow language. You generally want to avoid a loop in Python to get performance, and want to use inputs in a batch to use the full performance of your hardware.

1 Like