I am using GPT-2 to generate text from input sentences with a pre-trained DialoGPT model.
If my understanding is correct, due to the masked self-attention mechanism, it should be irrelevant whether a subsection starting from the beginning of a sentence or the whole sentence is used as input. The subsection should result in the same output as the whole sentence, only shorter.
However, while the outputs are almost identical, I observe differences of about e^-8 in many elements of the output logit tensors. With “output_attentions=True”, the attention scores are slightly different too. I generate the output in this way:
sentence = "I wonder what the model will do with this input sentence." enc_sentence = tokenizer.encode(sentence, return_tensors="pt") chatbot_input_ids = chatbot_model.prepare_inputs_for_generation(enc_sentence) chatbot_output = chatbot_model(**chatbot_input_ids) logits = chatbot_output.logits
When I use enc_sentence[:, :5] instead of enc_sentence as input, the logits[:, :5, :] are different. What could be the reason for this?