Inference input token number set as the max length always?

I am a littile confused. In transformer based LLM inference. We always set the input length as the max_input length and padding 0 , or we dynamically get the actual input length and decrease the computation?

it depends upon how you are inferring LLMs. while batching you pad it to the length of longest sequence in the batch. for accelerators, you might pad it to longest sequence and multiple of 8. and while single input inferencing you just pass it without padding. just to make sure that in text-generation, you pad to the left. since models might get hard time while start generation after pad token. but sometimes they have right side padding, like codellama. about the padding side. about padding side i wish i knew earlier just that.

Thanks for your reply. Let’s say in chat server, such as GPT or huggingchat. They will always pad to the max_length or just pass it without padding? I ask this question since I wanna understand efficient inference diving into the hardware level. Is it difficult to design the pattern just dynamically detect each time’s input tokens length and pass it without padding?

by default it always passes the sequence without padding. i am not sure about GPT but in general, if you are inferring it with single sequence. it does not pad. but it might be different for devices which compile model for XLA. in that case you need to pad, such that the model does not have to compile again and again.

Thanks again! That make sense. I’m not familiar with XLA. But if this is similar to CUBLAS etc, then I think there is a trade-off between the compile time and the extra-padding computation time, right? Is this true for GPU? If we passes without padding, then we need to compile every time when input is different?

By the way, what do you mean single sequence? Does it mean we only input one question into our model and do the inference? In other words, If consecutive questions, we need to set fix length?

i am not sure about GPU, since when i use batching i pad it to maximum sequence and while single inference, i just don’t change it. for TPU, it is specifically mentioned in the introduction. so that’s how i know. i am not sure about GPU.

Thanks! Could I ask the reason why batching should pad it to maximum sequence?

because, if we don’t pad it. it truncate to shortest one. which make cut of all the remaining sequences which are longer than shortest sequence. which is loss in input of remaining sequence and cause output which is not expected.

Oh, I understand, thanks!

these are the sources which i learned from. might help you.