Hi!
I have found such 2 discussions about an interesting behavior when the very first batch takes more time than the subsequent ones:
- python - GPU memory consumption and inference time is higher for first inference - Stack Overflow
- https://groups.google.com/g/torch7/c/tuQF7lSNU7Y
I did an experiment:
from transformers import AutoModel, AutoTokenizer
import time
model = AutoModel.from_pretrained("bert-base-uncased").cuda()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# create dummy input
input_ids = tokenizer("Hello, my dog is cute", return_tensors="pt")
for i in range(5):
for j in range(5):
start = time.time()
output = model(**input_ids.to("cuda"))
took = time.time() - start
print(j, ":", took)
time.sleep(40)
print("==============")
and I got the following output:
0 : 0.23949980735778809
1 : 0.005326509475708008
2 : 0.005161285400390625
3 : 0.005240917205810547
4 : 0.005146026611328125
==============
0 : 0.028321504592895508
1 : 0.005590200424194336
2 : 0.005166530609130859
3 : 0.0049932003021240234
4 : 0.004933595657348633
==============
0 : 0.023855924606323242
1 : 0.0057294368743896484
2 : 0.005516767501831055
3 : 0.006219148635864258
4 : 0.005515336990356445
==============
0 : 0.020789623260498047
1 : 0.005784273147583008
2 : 0.0058591365814208984
3 : 0.005324840545654297
4 : 0.005706787109375
==============
0 : 0.02687668800354004
1 : 0.008355855941772461
2 : 0.006246089935302734
3 : 0.006096601486206055
4 : 0.005678415298461914
==============
As you see, I was able to reproduce this behavior. What is very interesting - it’s that after sleeping some cache is reset.
And my question is: is it possible to remove this behavior using Optimum or other acceleration techniques?