Make correct padding for text generation with GPT-NEO

In order to make generate text sequences with GPT-NEO, I first load all the relevant components for sequence generation for GPTNeoForCausalLM.

from transformers import AutoTokenizer, GPTNeoForCausalLM
import torch
from torch.nn import functional as F


tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125m")
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m")

There are two ways how I can generate input_ids and attention_mask.

  1. I take the standard approach without padding
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
  1. I use padding instead
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'
tokenizer.truncation_side = 'left'
no_items_for_history = 30

inputs = tokenizer.encode_plus("Hello, my dog is cute", max_length=no_items_for_history, padding='max_length', truncation=True, return_tensors="pt")

Then for both approaches, I iteratively loop through everything in order generate the sequence on token at a time.

input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']


for i in range(10):
    if i == 0:
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=inputs["input_ids"])
    else:
        outputs = model(input_ids=new_input_ids, attention_mask=attention_mask, past_key_values=past_key_values)
    loss = outputs.loss
    logits = outputs.logits[:, -1, :]

    logits = F.softmax(logits, dim=1)

    topk_values, topk_indices = torch.topk(logits, 5)
    inputs_in_topk = torch.multinomial(topk_values, num_samples=1, replacement=True)
    new_input_ids = torch.gather(topk_indices, 1, inputs_in_topk)

    past_key_values = outputs.past_key_values
    attention_mask = torch.concat((attention_mask, torch.ones(1, 1).to(attention_mask.device)), dim=1)
    input_ids = torch.concat((input_ids, new_input_ids), dim=1)


print(tokenizer.decode(input_ids.tolist()[0], skip_special_tokens=True))

Here is the problem:

The starting input_ids and attention_mask for the first approach look like:

input_ids = tensor([[15496,    11,   616,  3290,   318, 13779]])
attention_mask = tensor([[1, 1, 1, 1, 1, 1]])

The output looks very sensible:

Hello, my dog is cute! This post is about dogs and cats

However, for the second approach the starting input_ids and attention_mask look like

input_ids = tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 15496,    11,   616,  3290,   318, 13779]])
attention_mask = tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]])

and it always generates nonsense like

Hello, my dog is cute pet is my pet pet pet is my dog is

Question: Do you know how to make it work with padding, i.e., the second approach?