I was reading GPT2 source code and related documentation about how past_key_values
can be used to speed up autoregressive generation, e.g., here and here.
This all makes sense when generating the first max_position_emeddings
(1024 for GPT2) many tokens, but I think would run into a problem on the 1025th token and beyond. i.e., once we exceed the maximum context length for the model, we will have to shift out old tokens and increment the position ids of the remaining tokens in the context window (thus they will receive new position embedding vectors), making their previously computed attention key/values invalid. Thus it would be incorrect to use the caching mechanism beyond the maximum context length for a model with learned absolute position encoding. The same can be said about the task of evaluating the log-likelihood/perplexity with a sliding window (which is more correct than using non-overlapping windows).