I am trying to get the probability of a sequence of input text being generated by passing the text through a forward pass of a causal lm model and then mapping the input tokens to the corresponding output logit position. However, I am unsure of the positioning of the output logits with respect to the input tokens. Do the output logits at position 0 correspond to the probability of the generating the first token given no context, or the probability of generating the second token given the first token in the input as context for generation?
In other words, if I am trying to map the output logits to the probability of generating the token positioned at index i given the rest of the previous input tokens, should I be looking at the output logits at position i, or position i + 1?