Llama position_ids

lshamis · March 4, 2024, 4:56pm

I’m trying to understand the Llama position_ids.

If I call the LlamaForCausalLM forward directly, providing input_ids and attention_mask, position_ids will be generated as an arange from 0 to the sequence length, with batch size 1 regardless of the input batch:

past_seen_tokens = 0
...
cache_position = torch.arange(
    past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
)
...
if position_ids is None:
    position_ids = cache_position.unsqueeze(0)

On the other hand, if I call the generate function, providing input_ids and attention_mask, it will call prepare_inputs_for_generation. This creates a batched position_ids with 1 values where attention_mask == 0

position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
if past_key_values:
    position_ids = position_ids[:, -input_ids.shape[1] :]

Code Links:

github.com

huggingface/transformers/blob/831bc25d8fdb85768402f772cf65cc3d7872b211/src/transformers/models/llama/modeling_llama.py#L999


      
                  past_seen_tokens = past_key_values.get_seq_length()
          
          if cache_position is None:
              if isinstance(past_key_values, StaticCache):
                  raise ValueError("cache_position is a required argument when using StaticCache.")
              cache_position = torch.arange(
                  past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
              )
          
          if position_ids is None:
              position_ids = cache_position.unsqueeze(0)
          
          causal_mask = self._update_causal_mask(attention_mask, inputs_embeds)
          
          # embed positions
          hidden_states = inputs_embeds
          
          # decoder layers
          all_hidden_states = () if output_hidden_states else None
          all_self_attns = () if output_attentions else None
          next_decoder_cache = None

github.com

huggingface/transformers/blob/831bc25d8fdb85768402f772cf65cc3d7872b211/src/transformers/models/llama/modeling_llama.py#L1275


      
              if (
                  max_cache_length is not None
                  and attention_mask is not None
                  and cache_length + input_ids.shape[1] > max_cache_length
              ):
                  attention_mask = attention_mask[:, -max_cache_length:]
          
          position_ids = kwargs.get("position_ids", None)
          if attention_mask is not None and position_ids is None:
              # create position_ids on the fly for batch generation
              position_ids = attention_mask.long().cumsum(-1) - 1
              position_ids.masked_fill_(attention_mask == 0, 1)
              if past_key_values:
                  position_ids = position_ids[:, -input_ids.shape[1] :]
          
          if self.generation_config.cache_implementation == "static":
              # generation with static cache
              cache_position = kwargs.get("cache_position", None)
              if cache_position is None:
                  past_length = 0
              else:

VictorSanh · March 5, 2024, 3:01am

The reason why in generate the position_ids tensor is computed from attention_mask is that batch generation require (or at least is muuuuch simpler) when doing left padding.

So let’s say you are preparing inputs for calling generate through your tokenizer, passing padding="left" will ensure proper padding at the beginning of the prompts (as opposed to the end).
attention_mask will look like something like [[0, 0, 1, 1, 1], [1, 1, 1, 1, 1]] which in turns will give [[-1, -1, 0, 1, 2], [0, 1, 2, 3, 4]] for the position_ids

lshamis · March 5, 2024, 4:18pm

Thanks @VictorSanh.

My question was more why the two code paths differ in how they generate position_ids. Both code paths have access to batch input and attention_mask, but produce different position_ids.

I agree that the version in prepare_inputs_for_generation makes more sense.

Minor nit, the masked_fill_ inserts 1, not -1. Not sure it matters, but your example would be [[1, 1, 0, 1, 2], [0, 1, 2, 3, 4]]

Topic		Replies	Views
Prompt printing gibberish Beginners	1	435	September 15, 2023
Replacing the LlamaDecoderLayer Class hugging Face With New LongNet Intermediate	0	159	March 30, 2024
Loss.backward() producing nan values with 8-bit Llama-3-70B-Instruct Models	3	63	May 1, 2024
Does model supports partial `past_key_values`? 🤗Transformers	0	338	May 12, 2023
GPT-2 Forward w/ and w/o caching of past gives different results Beginners	0	353	May 31, 2022

Llama position_ids

Related Topics