Newbie Understanding GPT2 loss

I am getting stuck with understanding the GPT2 loss. I want to give the model the label having the target it will generate so that I can see that loss is zero.

I have a input text
input_text = "Welcome to New York"
The current model predicts the next word as City
If I give the label as input_text the loss will never be zero. How do I simulate giving the label as “Welcome to New York City”, so that the internal neural net (irrespective of model will give loss as zero or near to that

To explain more what I am meaning, here is the snippet.

Note - I have read the forum and documents that the labels can be the same as the input text and that the model will shift left the labels, and that the loss is not calculated for the last token. But then still loss should become zero, which it is not.

Labels for language modeling. Note that the labels are shifted inside the model,
i.e. you can set labels = input_ids…

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name,model_max_length=1024,padding_side='left')
tokenizer.pad_token = tokenizer.eos_token # == <|endoftext|> = 50256
model = GPT2LMHeadModel.from_pretrained(model_name)

batch_size=5
input_text  = "<|endoftext|> Welcome to New York"
target_text = "Welcome to New York City"

# encode the inputs
encoding = tokenizer(input_text,padding=True,max_length=batch_size,truncation=True,return_tensors="pt",)
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
# encode the targets
target_encoding = tokenizer(target_text,padding=True,max_length=batch_size,truncation=True,return_tensors="pt",)
labels = target_encoding.input_ids
# replace padding token id's of the labels by -100 so it's ignored by the loss
labels[labels == tokenizer.pad_token_id] = -100  # in our case there is no padding
print(f"input_ids={input_ids}")
print(f"attention_mask={attention_mask}") # all ones
print(f"labels ={labels}")
# forward pass
outputs = model(input_ids=input_ids,labels=labels) 
print(f"Model Loss {outputs.loss}")
# Test the model to check what it predicts next
outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask,max_new_tokens=1)
answer = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(f"Result '{answer}'")

Output

input_ids=tensor([[50256, 19134,   284,   968,  1971]]) # not sure what eostoken (50256) in input does to model
attention_mask=tensor([[1, 1, 1, 1, 1]])
labels =tensor([[14618,   284,   968,  1971,  2254]]) # 2254 = City;  which is that the model should predict
Model Loss 8.248174667358398
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result '<|endoftext|> Welcome to New York City'

When I try something proper as is done everywhere

input_text  = "Welcome to New York"
target_text = input_text

I get a loss of about 3.26

input_ids=tensor([[14618,   284,   968,  1971]]) # 1971 = York
attention_mask=tensor([[1, 1, 1, 1]])
labels =tensor([[14618,   284,   968,  1971]])
Model Loss 3.2614505290985107
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result 'Welcome to New York City'

Is it that

outputs = model(input_ids=input_ids,labels=labels) 

is generating more than 1 token;

Answered here - pytorch - Huggingface GPT2 loss understanding - Stack Overflow