Make correct padding for text generation with GPT-NEO

junoriosity · July 5, 2023, 9:57pm

In order to make generate text sequences with GPT-NEO, I first load all the relevant components for sequence generation for GPTNeoForCausalLM.

from transformers import AutoTokenizer, GPTNeoForCausalLM
import torch
from torch.nn import functional as F


tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125m")
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m")

There are two ways how I can generate input_ids and attention_mask.

I take the standard approach without padding

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

I use padding instead

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'
tokenizer.truncation_side = 'left'
no_items_for_history = 30

inputs = tokenizer.encode_plus("Hello, my dog is cute", max_length=no_items_for_history, padding='max_length', truncation=True, return_tensors="pt")

Then for both approaches, I iteratively loop through everything in order generate the sequence on token at a time.

input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']


for i in range(10):
    if i == 0:
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=inputs["input_ids"])
    else:
        outputs = model(input_ids=new_input_ids, attention_mask=attention_mask, past_key_values=past_key_values)
    loss = outputs.loss
    logits = outputs.logits[:, -1, :]

    logits = F.softmax(logits, dim=1)

    topk_values, topk_indices = torch.topk(logits, 5)
    inputs_in_topk = torch.multinomial(topk_values, num_samples=1, replacement=True)
    new_input_ids = torch.gather(topk_indices, 1, inputs_in_topk)

    past_key_values = outputs.past_key_values
    attention_mask = torch.concat((attention_mask, torch.ones(1, 1).to(attention_mask.device)), dim=1)
    input_ids = torch.concat((input_ids, new_input_ids), dim=1)


print(tokenizer.decode(input_ids.tolist()[0], skip_special_tokens=True))

Here is the problem:

The starting input_ids and attention_mask for the first approach look like:

input_ids = tensor([[15496,    11,   616,  3290,   318, 13779]])
attention_mask = tensor([[1, 1, 1, 1, 1, 1]])

The output looks very sensible:

Hello, my dog is cute! This post is about dogs and cats

However, for the second approach the starting input_ids and attention_mask look like

input_ids = tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 15496,    11,   616,  3290,   318, 13779]])
attention_mask = tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]])

and it always generates nonsense like

Hello, my dog is cute pet is my pet pet pet is my dog is

Question: Do you know how to make it work with padding, i.e., the second approach?

Topic		Replies	Views
Code is working fine for Bert and Roberta However Fails During GPTNeo Beginners	2	291	February 27, 2024
How to set the padding configuration with Huggingface's GenerateMixin's generate method? Intermediate	7	11200	September 26, 2023
What is the correct format of input when fine-tuning GPT2 for text generation with batch input? Models	0	506	January 22, 2024
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation Beginners	5	46129	September 24, 2024
How to properly tokenize and pack sequences with EOS token handling for GPT-2 fine-tuning in Hugging Face Transformers? Beginners	2	687	August 21, 2024

Make correct padding for text generation with GPT-NEO

Related topics