The notebook will create examples which have the same text in the input and the labels. What is the purpose of such a model? Is it training some autoencoder task? I would think a more interesting challenge would be: Given input sample of text, have the label be the continuation of the sample of text.
As mentioned in the notebooks, the task is causal language modeling at first, so predict the next word. They also explicitly say that:
First note that we duplicate the inputs for our labels. This is because the model of the Transformers library apply the shifting to the right, so we don’t need to do it manually.
Which is why you see the same labels as the inputs.
I am not sure what you mean by “switch the attention set”. It applies the attention mask to hide future tokens if it’s what you mean (otherwise you would see a perplexity of 0 or 1 at the end of training).