GPU inference slows down if done in a loop

issafuad · July 17, 2020, 5:00pm

Hi I have noticed that inference time is very quick if running the model on one batch. However, once inference is ran in a loop - even if on the same input - it slows down significantly.

I have actually seen the same behaviour on tensorflow models. Is an expected behaviour or is an issue with cuda etc.

Please find the notebook to see the issue
https://colab.research.google.com/drive/1gqSzQqFm8HL0OwmJzSRlcRFQ3FOpnvFh?usp=sharing

sgugger · July 20, 2020, 1:42pm

This is because Python is a slow language. You generally want to avoid a loop in Python to get performance, and want to use inputs in a batch to use the full performance of your hardware.

Topic		Replies	Views
Continuous execution lead to decreasing inference time Beginners	0	17	October 28, 2024
Batched pipeline inference has little speed improvement on longer texts Beginners	1	1896	October 27, 2023
When I try to inference on multiple GPUs using multiple processes, the time for model. generate() becomes very long 🤗Transformers	0	475	June 12, 2023
Optimising performance non-standard systems 🤗Transformers	2	778	February 16, 2022
High variability of CPU inference times Beginners	4	52	January 30, 2025

GPU inference slows down if done in a loop

Related topics