How is the prompt + answer handled during training


I am really confused on how the model train on the prompt and the answer of the prompt. Does the model use the whole prompt as a context and tries to predict each token of the answer, sliding the context window. Or does it start the training by trying to predict also each token in the prompt ?