Overhead caused by moving eos_token_id to gpu mem

sangyoonlee · January 26, 2024, 10:52am

Hi, I’m considering using transformers to real-time service. (call model.generate every seconds or more).
But in greedy search, there’s a code line which move eos token to gpu mem and It takes 170ms.(A100GPU)
The exact line of code is eos_token_id_tensor = torch.tensor(eos_token_id).to(input_ids.device) if eos_token_id is not None else None in here
Cost of 170ms in every 1s inference is quite large, and I think moving same token to gpu mem is inefficient.
How can I avoid this?

DenseLance · January 28, 2024, 12:36pm

Pretty sure that eos_token_id is an integer here, not a torch tensor. The conversion from an integer to a list then to a torch tensor via torch.tensor(eos_token_id) is the more likely reason to why that line is taking up quite some time. However, I think that the overhead should not be that significant (aka the ratio of time taken to compute the line you mentioned and model.generate() is too large).

Based on my experimentations:

@timer
def test1():
    eos_token_id_tensor = torch.tensor([model.generation_config.eos_token_id]).to(input_ids.device)

@timer
def test2():
    output = model.generate(input_ids.unsqueeze(0), max_new_tokens = 1)

test1()
test2()

Output:

Total execution time for test1: 1 ms
Total execution time for test2: 1029 ms

My ratio is 1:1029, in comparison to your 170:1000. I think you should test it out again.

sangyoonlee · January 29, 2024, 6:35am

Which GPU do you use? In my A100 GPU, following code returns 260ms of latency at first load() call.
import torch
from datetime import datetime
def load():
start = datetime.now()
token = [0]
torch.tensor(token).to(“cuda:0”)
end = datetime.now()
print((end - start).total_seconds())
return

for i in range(5):
load()

DenseLance · January 29, 2024, 8:57am

For the initial performance overhead, it’s a known issue with PyTorch. It will save the result into the cache, which allows for the next few similar tensors to be created much faster. Unless you are running model.generate() with max_new_tokens =1, this overhead isn’t exactly noticeable as the tokens generated increases.

I can also reproduce your problem on a T4 Google colab GPU, but technically speaking it isn’t a problem if you’re generating a lot of tokens, and the significant overhead will no longer be there when you run model.generate() for a second time.

sangyoonlee · February 1, 2024, 5:21am

Yeah I agree with you comment but in my case, I have to run model.generate() every seconds or more maybe, since I’m considering real-time application with whisper.
The model is served by pytorch and the overhead arises every run. This is the problem.

DenseLance · February 1, 2024, 6:40am

Can you try model.generate() on random input_ids for a couple of iterations, and see whether the overhead remains even after the first iteration? I don’t exactly know how you are running your model, but do you also reset the cache/instance whenever you want to get an inference?

sangyoonlee · February 1, 2024, 9:34am

Yeah I tried test code the way you explained, as below(code 1), but the result was same. Every first tensor movement to gpu costs 0.8s.
I modified transformers code(at generation/utils.py) to l find latency(code 2)

Code 1.

model: WhisperForConditionalGeneration = WhisperForConditionalGeneration.from_pretrained(pretrained_model_name_or_path=model_folder_path).to("cuda:0")

with open(feature_path,'rb') as f:
    input_features = pickle.load(f)
audio_feature: torch.Tensor = torch.as_tensor(input_features).to("cuda:0")
for _ in range(5):
    model.generate(audio_feature)

Code 2.

eos_token_id_tensor_start = time.perf_counter()
eos_token_id_tensor = torch.tensor(eos_token_id).to(input_ids.device) if eos_token_id is not None else None
eos_token_id_tensor_end = time.perf_counter()

DenseLance · February 1, 2024, 1:25pm

Could you try this method where you synchronize the GPU while timing the speed of the code via what was mentioned here?

Also, may I ask why are you using pickle.load for your input features? What is the data type of the input features before converting it via torch.as_tensor?

sangyoonlee · February 2, 2024, 6:21am

There’s no specific purpose of using pickel.load in my test code, just for extracted audio feature from real data, instead of randomly generated tensor.
The type of data is List[float], with shape of (1,80,3000). Since request body of my server (fastapi) should be List[float],(request from feature-extraction server) I build test code with list type.
BTW, I don’t understand what you mentioned above. Add torch.cuda.synchronize() in transformers code will improve performance, you mean?

DenseLance · February 2, 2024, 7:22am

torch.cuda.synchronize() will not improve performance, but it does ensure that the timing that we have recorded is consistent. The function waits for all kernels in all CUDA streams to complete, which is why it may also introduce overhead during kernel launch.

DenseLance · February 2, 2024, 7:27am

For code 2 (from transformers code), maybe you could convert the tensor to the correct device (cuda) directly via eos_token_id_tensor = torch.tensor(eos_token_id, device = input_ids.device) if eos_token_id is not None else None? I don’t have a GPU with me right now so I can’t test this hypothesis out.

sangyoonlee · February 2, 2024, 7:31am

I breakdown transformers code and got some findings.

There’s 2 tensor process across cpu and gpu memory and thery are bottlenecks.
First one is in _prepare_attention_mask_for_generation function, find whether int(cpu) in tensor(gpu) → (pad_token_id in inputs) : 0.8s in RTX A2000 (laptop)
The second one is what I mentioned above, move eos token to gpu. (0.8s)
Strange thing is, latency in 2 gets low with next inference iterations but 3 stays the same.

I guess gpu warmup works at 2 and encoding process (which is 1.3s at first try and decreased to 12ms. The exact code is model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs) but at 3, it seems no improvement in next same processes.

DenseLance · February 2, 2024, 7:47am

I noticed an issue similar to yours that has remained open at PyTorch’s repo: model->to(device) costs over a millisecond when doing nothing · Issue #23865 · pytorch/pytorch · GitHub

sangyoonlee · February 6, 2024, 8:00am

Thank you for reply!
As far as I understand, we can’t avoid overhead, am I right?
I guess we should take that amount of latency in real-time inference scenario.

DenseLance · February 7, 2024, 2:37am

Yep, seems to be the case as of now.

Topic		Replies	Views
Potential bug with beam search + eos_token_id 🤗Transformers	1	654	October 19, 2023
Baffling performance issue on most NVidia GPUs with simple transformers + pytorch code Intermediate	5	4504	April 9, 2024
Why transformers doesn't use Multiple GPUs (to increase tokens per second)? Beginners	7	590	September 22, 2024
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation Beginners	5	46128	September 24, 2024
Running inference on OPT 30m on GPU Beginners	2	2270	May 18, 2022

Overhead caused by moving eos_token_id to gpu mem

Related topics