GPU inference slows down if done in a loop

Hi I have noticed that inference time is very quick if running the model on one batch. However, once inference is ran in a loop - even if on the same input - it slows down significantly.

I have actually seen the same behaviour on tensorflow models. Is an expected behaviour or is an issue with cuda etc.

Please find the notebook to see the issue


This is because Python is a slow language. You generally want to avoid a loop in Python to get performance, and want to use inputs in a batch to use the full performance of your hardware.

1 Like