Problems with understanding instruction fine-tuning

I’m trying to read up on instruction fine-tuning, but I think I have a big misunderstanding.

As I understand, instruction datasets typically have 3 components: (a) the instruction (b) the output/response, and (c) and an optional input. Now, according to this paper: “Based on the collected IT dataset, a pretrained model can be directly fine-tuned in a fully-supervised manner, where given the instruction and the input, the model is trained by predicting each token in the output sequentially.” This makes sense to me, i.e, the response/output is the ground truth the model is expected to predict.

However, when I check many tutorials (e.g., this tutorial notebook), it seems that instructions, inputs, and outputs are all combined into a single text sample. But I can’t tell from the notebook how the training now works. What is now the ground truth for the supervised training. Or is this now treated as a next-word-prediction task?

What am I missing? Or are these indeed two different approaches for instruction tuning. Sorry if those a stupid questions!