What is the correct format of input when fine-tuning GPT2 for text generation with batch input?

zz545906747 · January 22, 2024, 1:30am

I want to fine-tune GPT2 for text generation with batch input. And I use follow code to format batch input:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained(r'E:\pythonWork\models\gpt2')
max_length = 8
datas = [
    "The dog.",
    "The cute dog.",
]
model_input = tokenizer(datas)
print('original input:\n', model_input)

# prepare for batch input
# I add bos token at the start and eos token at the end, and add pad token at the right to pad the sentences to the 
# same length. bos_token_id=eos_token_id=50256, and there is not a pad token, so i also use 50256 as pad token. 

labels_list = []
for i in range(len(datas)):
    input_ids = [tokenizer.bos_token_id] + model_input['input_ids'][i] + [tokenizer.eos_token_id]  # add  bos and eos token
    input_ids = input_ids + max(0, max_length-len(input_ids))*[tokenizer.eos_token_id]  # add padding token
    attention_mask = [1] + model_input['attention_mask'][i] + [1]  # atten bos and eos token
    attention_mask = attention_mask + max(0, max_length - len(attention_mask)) * [0]  # dose't atten padding token
    labels = [tokenizer.bos_token_id] + model_input['input_ids'][i] + [tokenizer.eos_token_id]  # take loss for bos and eos
    labels = labels + max(0, max_length - len(labels)) * [-100]  # padding dose't take loss
    model_input['input_ids'][i] = input_ids
    model_input['attention_mask'][i] = attention_mask
    labels_list.append(labels)

model_input['labels'] = labels_list
print('batch input:\n', model_input)

terminal output

original input:
 {'input_ids': [[464, 3290, 13], [464, 13779, 3290, 13]], 
'attention_mask': [[1, 1, 1], [1, 1, 1, 1]]}
batch input:
 {'input_ids': [[50256, 464, 3290, 13, 50256, 50256, 50256, 50256], [50256, 464, 13779, 3290, 13, 50256, 50256, 50256]], 
'attention_mask': [[1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 0, 0]], 
'labels': [[50256, 464, 3290, 13, 50256, -100, -100, -100], [50256, 464, 13779, 3290, 13, 50256, -100, -100]]}

my question:
1. the method I take to format batch input, is it right?
2. why can't gpt2 tokenizer auto format batch input like bert tokenzier do?
3. in this pre-training [demo](https://huggingface.co/learn/nlp-course/en/chapter7/6?fw=pt#preparing-the-dataset), 
 I found that it dose't add bos and eos tokens,  and add pad token only at the end of the sequence. 
So I think, in the pre-training time only need to add pad token to keep the sequence length consistent. 
But when it comes to fine-tuning, additional eos tokens need to be added, and eos needs take loss because the model needs to learn when to stop generating.
 Am I right?

Topic		Replies	Views
Gpt2 token of specific string 🤗Transformers	0	291	March 30, 2023
How to properly tokenize and pack sequences with EOS token handling for GPT-2 fine-tuning in Hugging Face Transformers? Beginners	2	245	August 21, 2024
How to fine-tune GPT on my own data for text generation Beginners	0	2157	January 17, 2022
Batch generation with GPT2 🤗Transformers	12	16865	January 16, 2024
Make correct padding for text generation with GPT-NEO 🤗Tokenizers	0	792	July 5, 2023

What is the correct format of input when fine-tuning GPT2 for text generation with batch input?

Related topics