(first token generation puzzle)Why does transformers take the last dimension as output when generating the first token in language generation process?

luckyvovo · January 30, 2024, 12:24pm

here in this line:

huggingface/transformers/blob/main/src/transformers/generation/utils.py#L2326


      
          outputs = self(
              **model_inputs,
              return_dict=True,
              output_attentions=output_attentions,
              output_hidden_states=output_hidden_states,
          )
          
          if synced_gpus and this_peer_finished:
              continue  # don't waste resources running the code we don't need
          
          next_token_logits = outputs.logits[:, -1, :]
          
          # pre-process distribution
          next_tokens_scores = logits_processor(input_ids, next_token_logits)
          
          # Store scores, attentions and hidden_states when required
          if return_dict_in_generate:
              if output_scores:
                  scores += (next_tokens_scores,)
              if output_attentions:
                  decoder_attentions += (

When I am using llama2-7b to generate languages, my prompt has 8 words and I get the outputs.logits tensor with shape [1,11,32000] (which means 11 tokens) and this line of code gets next_token_logits tensor with shape [1,32000], as the next token embedding.

My question is , why does this line of code just pick the last dimension(-1) of 11 from [1,11,32000] as the predicted token?

(I am new here, and this is somewhat confusing to me, since in my expectation, the LLM should predict something like [1,12,32000] given [1,11,32000], and pick the last dimension of 12, not like what I described above)

dblakely · January 30, 2024, 2:56pm

The logits outputted at each position are for predicting what token should come next after that token. If you input 11 tokens, the logits outputted at position 11 are predictions for what the 12th token should be (and this is true for the previous tokens too; the logits outputted at position 5 are predictions for what token 6 should be, but obviously you already have token 6, so we ignore those logits). To actually generate the 12th token, we convert the logits to probabilities and sample from that distribution to pick what should come after the 11th token. That token will then get fed into the model, leading to predictions for what the 13th token should be. And so on.

(I am new here, and this is somewhat confusing to me, since in my expectation, the LLM should predict something like [1,12,32000] given [1,11,32000], and pick the last dimension of 12, not like what I described above)

The reason it doesn’t work like this is because at this point, the 12th token doesn’t exist yet. It’ll only exist after we’ve sampled a new token after obtaining the logits from the 11th token (predictions for what the 12th token should be).

luckyvovo · January 31, 2024, 2:29am

Thanks for your kind reply!
Besides, LLMs are predicting what token should come next after that token----Is that because the LLMs are trained to do like this(for instance, the training data is predicting the next word)?

If LLMs are trained to do like this, since you mentioned
" the logits outputted at position 5 are predictions for what token 6 should be, but obviously you already have token 6, so we ignore those logits"

So in practice, that is the reason why we need attention mask to ignore token 6 generation(to save computation),since we can ignore token 6, which we already have?

dblakely · January 31, 2024, 3:38am

Right, language models are trained like that – a training example is fed into the model and each token in the sequence is trying to predict the token that comes next. For example, if the training example is:

<s> The quick brown fox jumps over the lazy dog.</s>

The logits outputted for the <s> token are to predict the word The, the logits for The are to predict the word quick, and so on. But unlike generating a sequence at inference-time, this all happens with just one pass through the model. Every token in the training sequence is a classification problem to predict what should come next and each of these classification problems are run in parallel.

In contrast, when you do inference the prompt tokens you feed in are processed with one pass through the model but then each generated token after that point is created one by one by sampling the logits produced by the token that came right before.

So in practice, that is the reason why we need attention mask to ignore token 6 generation(to save computation),since we can ignore token 6, which we already have?

The attention mask is actually to make it so that inside the model, the hidden states for token 6 are only attending to the tokens that came before, not after. So if at inference-time you fed

<s> The quick brown fox jumps

into the model, the attention mask ensures The only attends to the <s>, quick only attends to <s> The, fox only attends to <s> The quick brown (not jumps), etc. The way attention works is that each token should only see what came before it and so the mask ensures this. The way you can think of it is that if a token is the present, the previous tokens are the past and it attends to those (plus itself). But it’s not allowed to see the future, it’s just trying to predict the future.[1]

So going back to your example - while we can ignore the logits outputted for token 5 (because we already have token 6), we can’t actually ignore token 6 itself. It still needs to be fed into the model and processed by having it attend to the tokens that came before it. This will be a very hand wavy explanation, but in order for the model to predict token 12, it does need to have an internal “representation” of the full sequence that came before, including token 6 and the role 6 is playing within the sequence.

[1] Technically, not all architectures work this way, a lot of models (namely encoder models) do have bidirectional attention, where each token can see the previous tokens + subsequent tokens. But the vast majority of language models aren’t like this.

luckyvovo · January 31, 2024, 4:47am

Thanks for your kind reply!

So according to your descriptions, it seems that the attention mask is acutually not needed during inference, but only needed during training?-----

because during inference even if we want to “attend” the “future”, there is no future for us to attend.

If attention mask is not needed during inference, why is it included in LLM inference code, including huggingface transformers lib?

dblakely · January 31, 2024, 2:54pm

When you provide a prompt during inference, the attention mask is needed for that for the first forward pass when all the prompt tokens go through the model. If your prompt is:

<s> The quick brown fox

Then the “future” for quick is brown fox so it shouldn’t attend to those two tokens. But it should attend to <s> The. After that though when the model is generating one token at a time, there’s no need for an attention mask.

Ilia09 · August 14, 2024, 12:36pm

As I got it, during the inference, when I input “The quick brown fox,” the model predicts the next word after “The,” then after “The quick,” and so on. Why does it predict tokens that are already in the input? Why doesn’t it start predicting directly after the entire input, like after “The quick brown fox”? If the model predicts a word like “tree” after “The quick brown,” do we continue with “The quick brown tree”? If not, why do we spend computational resources on these predictions?
I’m really struggling with this question, and your help would be greatly appreciated!

dblakely · August 21, 2024, 12:55am

As I got it, during the inference, when I input “The quick brown fox,” the model predicts the next word after “The,” then after “The quick,” and so on. Why does it predict tokens that are already in the input?

Sorry if my previous posts were confusing - no, if you’re doing inference and you give the model the input “The quick brown fox”, then the model is just trying to predict what comes after “fox” (probably “jumps”).

MSY001 · May 11, 2025, 1:40am

Hi, my question is why LLM do a full Q-K matrix multiplication when generating the first token? like why bother generating token 6 while token 6 is already there? Why not just predict the next token like it does in the decode phase?

dblakely · May 11, 2025, 2:47am

Let me know if I’m not answering the question - my memory about this thread is a bit hazy since this thread was originally more than a year ago.

But if you have these tokens and you’re about to feed them into a transformer for the very first step of inference:

input_tokens = [<s>, The, quick, brown, fox, jumps]

Where the 6th token is jumps, then all of these tokens are included in the Q and K matrices during the first pass (the “prefill” step). But you aren’t generating any of them, since you already have them of course, as you said. What you’re doing is getting the output probabilities for what should come next (i.e., after jumps). So it’s not about “generating” the 6th token, it’s about “processing” it in order to predict the 7th token.

Part of what you have to do to “process” these tokens is that they all have to be packed into the Q and K matrices to compute the attention formula.

Topic		Replies	Views
Get each generated token last layer hidden state 🤗Transformers	3	47	March 16, 2025
Understanding Output of `PreTrainedModel.forward` Beginners	2	1946	February 12, 2024
Understanding model output arrays Beginners	0	616	August 28, 2022
Results of model.generate are different for different batch sizes of the decode-only model Beginners	6	6030	April 14, 2024
Outputs.hidden_states[0][-1] always returns the same logit regardless of the question Beginners	0	35	October 29, 2024

(first token generation puzzle)Why does transformers take the last dimension as output when generating the first token in language generation process?

Related topics