Is it possible to make the first batch as fast as the subsequent ones?

Hi!

I have found such 2 discussions about an interesting behavior when the very first batch takes more time than the subsequent ones:

I did an experiment:

from transformers import AutoModel, AutoTokenizer
import time


model = AutoModel.from_pretrained("bert-base-uncased").cuda()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# create dummy input
input_ids = tokenizer("Hello, my dog is cute", return_tensors="pt")


for i in range(5):
    for j in range(5):
        start = time.time()
        output = model(**input_ids.to("cuda"))
        took = time.time() - start
        print(j, ":", took)
    time.sleep(40)
    print("==============")

and I got the following output:

0 : 0.23949980735778809
1 : 0.005326509475708008
2 : 0.005161285400390625
3 : 0.005240917205810547
4 : 0.005146026611328125
==============
0 : 0.028321504592895508
1 : 0.005590200424194336
2 : 0.005166530609130859
3 : 0.0049932003021240234
4 : 0.004933595657348633
==============
0 : 0.023855924606323242
1 : 0.0057294368743896484
2 : 0.005516767501831055
3 : 0.006219148635864258
4 : 0.005515336990356445
==============
0 : 0.020789623260498047
1 : 0.005784273147583008
2 : 0.0058591365814208984
3 : 0.005324840545654297
4 : 0.005706787109375
==============
0 : 0.02687668800354004
1 : 0.008355855941772461
2 : 0.006246089935302734
3 : 0.006096601486206055
4 : 0.005678415298461914
==============

As you see, I was able to reproduce this behavior. What is very interesting - it’s that after sleeping some cache is reset.

And my question is: is it possible to remove this behavior using Optimum or other acceleration techniques?

And yeah, if I do the same on the CPU - all batches will have approximately the same time to compute