Llama position_ids

I’m trying to understand the Llama position_ids.

If I call the LlamaForCausalLM forward directly, providing input_ids and attention_mask, position_ids will be generated as an arange from 0 to the sequence length, with batch size 1 regardless of the input batch:

past_seen_tokens = 0
...
cache_position = torch.arange(
    past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
)
...
if position_ids is None:
    position_ids = cache_position.unsqueeze(0)

On the other hand, if I call the generate function, providing input_ids and attention_mask, it will call prepare_inputs_for_generation. This creates a batched position_ids with 1 values where attention_mask == 0

position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
if past_key_values:
    position_ids = position_ids[:, -input_ids.shape[1] :]

Code Links:

The reason why in generate the position_ids tensor is computed from attention_mask is that batch generation require (or at least is muuuuch simpler) when doing left padding.

So let’s say you are preparing inputs for calling generate through your tokenizer, passing padding="left" will ensure proper padding at the beginning of the prompts (as opposed to the end).
attention_mask will look like something like [[0, 0, 1, 1, 1], [1, 1, 1, 1, 1]] which in turns will give [[-1, -1, 0, 1, 2], [0, 1, 2, 3, 4]] for the position_ids

Thanks @VictorSanh.

My question was more why the two code paths differ in how they generate position_ids. Both code paths have access to batch input and attention_mask, but produce different position_ids.

I agree that the version in prepare_inputs_for_generation makes more sense.

Minor nit, the masked_fill_ inserts 1, not -1. Not sure it matters, but your example would be [[1, 1, 0, 1, 2], [0, 1, 2, 3, 4]]