Hi, I’m using a decoder-only model to do auto-regressive generation. E.g. Llama. The inputs_embed
of different instances have different lengths, so I have to pad them to the same length within a batch as the input. How can I use something like the attention_mask
to tell the model the length of the real input as it does for input_ids
. I’m not able to directly input input_ids
as the inputs are some soft prompts from other modalities.
A straightforward way is just to loop over the mini-batch and manually take the non-padding input embedding as input one by one. But it seems very inefficient. Is there a better way to do it? Many thanks!