Overhead caused by moving eos_token_id to gpu mem

Hi, I’m considering using transformers to real-time service. (call model.generate every seconds or more).
But in greedy search, there’s a code line which move eos token to gpu mem and It takes 170ms.(A100GPU)
The exact line of code is eos_token_id_tensor = torch.tensor(eos_token_id).to(input_ids.device) if eos_token_id is not None else None in here
Cost of 170ms in every 1s inference is quite large, and I think moving same token to gpu mem is inefficient.
How can I avoid this?

Pretty sure that eos_token_id is an integer here, not a torch tensor. The conversion from an integer to a list then to a torch tensor via torch.tensor(eos_token_id) is the more likely reason to why that line is taking up quite some time. However, I think that the overhead should not be that significant (aka the ratio of time taken to compute the line you mentioned and model.generate() is too large).

Based on my experimentations:

@timer
def test1():
    eos_token_id_tensor = torch.tensor([model.generation_config.eos_token_id]).to(input_ids.device)

@timer
def test2():
    output = model.generate(input_ids.unsqueeze(0), max_new_tokens = 1)

test1()
test2()

Output:

Total execution time for test1: 1 ms
Total execution time for test2: 1029 ms

My ratio is 1:1029, in comparison to your 170:1000. I think you should test it out again.

Which GPU do you use? In my A100 GPU, following code returns 260ms of latency at first load() call.
import torch
from datetime import datetime
def load():
start = datetime.now()
token = [0]
torch.tensor(token).to(“cuda:0”)
end = datetime.now()
print((end - start).total_seconds())
return

for i in range(5):
load()

For the initial performance overhead, it’s a known issue with PyTorch. It will save the result into the cache, which allows for the next few similar tensors to be created much faster. Unless you are running model.generate() with max_new_tokens =1, this overhead isn’t exactly noticeable as the tokens generated increases.

I can also reproduce your problem on a T4 Google colab GPU, but technically speaking it isn’t a problem if you’re generating a lot of tokens, and the significant overhead will no longer be there when you run model.generate() for a second time.

Yeah I agree with you comment but in my case, I have to run model.generate() every seconds or more maybe, since I’m considering real-time application with whisper.
The model is served by pytorch and the overhead arises every run. This is the problem.

Can you try model.generate() on random input_ids for a couple of iterations, and see whether the overhead remains even after the first iteration? I don’t exactly know how you are running your model, but do you also reset the cache/instance whenever you want to get an inference?

Yeah I tried test code the way you explained, as below(code 1), but the result was same. Every first tensor movement to gpu costs 0.8s.
I modified transformers code(at generation/utils.py) to l find latency(code 2)

Code 1.

model: WhisperForConditionalGeneration = WhisperForConditionalGeneration.from_pretrained(pretrained_model_name_or_path=model_folder_path).to("cuda:0")

with open(feature_path,'rb') as f:
    input_features = pickle.load(f)
audio_feature: torch.Tensor = torch.as_tensor(input_features).to("cuda:0")
for _ in range(5):
    model.generate(audio_feature)

Code 2.

eos_token_id_tensor_start = time.perf_counter()
eos_token_id_tensor = torch.tensor(eos_token_id).to(input_ids.device) if eos_token_id is not None else None
eos_token_id_tensor_end = time.perf_counter()

Could you try this method where you synchronize the GPU while timing the speed of the code via what was mentioned here?

Also, may I ask why are you using pickle.load for your input features? What is the data type of the input features before converting it via torch.as_tensor?

There’s no specific purpose of using pickel.load in my test code, just for extracted audio feature from real data, instead of randomly generated tensor.
The type of data is List[float], with shape of (1,80,3000). Since request body of my server (fastapi) should be List[float],(request from feature-extraction server) I build test code with list type.
BTW, I don’t understand what you mentioned above. Add torch.cuda.synchronize() in transformers code will improve performance, you mean?

torch.cuda.synchronize() will not improve performance, but it does ensure that the timing that we have recorded is consistent. The function waits for all kernels in all CUDA streams to complete, which is why it may also introduce overhead during kernel launch.

For code 2 (from transformers code), maybe you could convert the tensor to the correct device (cuda) directly via eos_token_id_tensor = torch.tensor(eos_token_id, device = input_ids.device) if eos_token_id is not None else None? I don’t have a GPU with me right now so I can’t test this hypothesis out.

I breakdown transformers code and got some findings.

  1. There’s 2 tensor process across cpu and gpu memory and thery are bottlenecks.
  2. First one is in _prepare_attention_mask_for_generation function, find whether int(cpu) in tensor(gpu) → (pad_token_id in inputs) : 0.8s in RTX A2000 (laptop)
  3. The second one is what I mentioned above, move eos token to gpu. (0.8s)
  4. Strange thing is, latency in 2 gets low with next inference iterations but 3 stays the same.

I guess gpu warmup works at 2 and encoding process (which is 1.3s at first try and decreased to 12ms. The exact code is model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs) but at 3, it seems no improvement in next same processes.

I noticed an issue similar to yours that has remained open at PyTorch’s repo: model->to(device) costs over a millisecond when doing nothing · Issue #23865 · pytorch/pytorch · GitHub

Thank you for reply!
As far as I understand, we can’t avoid overhead, am I right?
I guess we should take that amount of latency in real-time inference scenario.

Yep, seems to be the case as of now.