Inference input token number set as the max length always?

kaixin123 · April 21, 2024, 3:40pm

I am a littile confused. In transformer based LLM inference. We always set the input length as the max_input length and padding 0 , or we dynamically get the actual input length and decrease the computation?

vivek9840 · April 21, 2024, 7:26pm

it depends upon how you are inferring LLMs. while batching you pad it to the length of longest sequence in the batch. for accelerators, you might pad it to longest sequence and multiple of 8. and while single input inferencing you just pass it without padding. just to make sure that in text-generation, you pad to the left. since models might get hard time while start generation after pad token. but sometimes they have right side padding, like codellama. about the padding side. about padding side i wish i knew earlier just that.

kaixin123 · April 21, 2024, 7:44pm

Thanks for your reply. Let’s say in chat server, such as GPT or huggingchat. They will always pad to the max_length or just pass it without padding? I ask this question since I wanna understand efficient inference diving into the hardware level. Is it difficult to design the pattern just dynamically detect each time’s input tokens length and pass it without padding?

vivek9840 · April 21, 2024, 7:59pm

by default it always passes the sequence without padding. i am not sure about GPT but in general, if you are inferring it with single sequence. it does not pad. but it might be different for devices which compile model for XLA. in that case you need to pad, such that the model does not have to compile again and again.

kaixin123 · April 21, 2024, 8:16pm

Thanks again! That make sense. I’m not familiar with XLA. But if this is similar to CUBLAS etc, then I think there is a trade-off between the compile time and the extra-padding computation time, right? Is this true for GPU? If we passes without padding, then we need to compile every time when input is different?

kaixin123 · April 21, 2024, 8:18pm

By the way, what do you mean single sequence? Does it mean we only input one question into our model and do the inference? In other words, If consecutive questions, we need to set fix length?

vivek9840 · April 21, 2024, 8:19pm

i am not sure about GPU, since when i use batching i pad it to maximum sequence and while single inference, i just don’t change it. for TPU, it is specifically mentioned in the introduction. so that’s how i know. i am not sure about GPU.

kaixin123 · April 21, 2024, 8:23pm

Thanks! Could I ask the reason why batching should pad it to maximum sequence?

vivek9840 · April 21, 2024, 8:26pm

because, if we don’t pad it. it truncate to shortest one. which make cut of all the remaining sequences which are longer than shortest sequence. which is loss in input of remaining sequence and cause output which is not expected.

kaixin123 · April 21, 2024, 8:27pm

Oh, I understand, thanks!

vivek9840 · April 21, 2024, 8:28pm

these are the sources which i learned from. might help you.

Topic		Replies	Views
Sequences shorter than model's input window size 🤗Transformers	2	1172	January 4, 2022
Why does padding = 'max_length' cause much slower model inference? Models	1	620	June 8, 2023
How to set 'max_length' properly when using pipeline? 🤗Transformers	4	1594	November 18, 2024
[Tokenizers]What this max_length number? Beginners	3	2469	March 3, 2025
Confused about max_length and max_new_tokens 🤗Transformers	7	36160	September 5, 2024

Inference input token number set as the max length always?

Related topics