Supervised Fine-tuning Trainer - where is the 'supervised' part?


I have been recently reviewing the Supervised Fine-tuning Trainer page and in the Quickstart section they mention about (supervised) fine-finetuning on imdb dataset (‘text’ field) which contains movie reviews. As a model they use AutoModelForCausalLM.from_pretrained(“facebook/opt-350m”). In this case, what exactly does it mean to fine-tune this model in a supervised fashion? As far as I know, the causal lm modeling is based on next token prediction so in case of fine-tuning it with imdb dataset do we simply continue the base model training and predicting the next word from the imdb text field input?

The same question refers to the ‘Format your input prompts’ section (i.e. instruction tuning) from the same page. In this case, when using a typical autoregressive, decoder based model, do we continue its training by providing the properly formated ‘input-response’ text (text = f"### Question: {example[‘question’][i]}\n ### Answer: {example[‘answer’][i]}") and its role is to predict next word/token from the text provided? So the difference between the first and the second case is simply the text input format?