As I got it, during the inference, when I input “The quick brown fox,” the model predicts the next word after “The,” then after “The quick,” and so on. Why does it predict tokens that are already in the input? Why doesn’t it start predicting directly after the entire input, like after “The quick brown fox”? If the model predicts a word like “tree” after “The quick brown,” do we continue with “The quick brown tree”? If not, why do we spend computational resources on these predictions?
I’m really struggling with this question, and your help would be greatly appreciated!