I want to fine-tune GPT2 for text generation with batch input. And I use follow code to format batch input:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(r'E:\pythonWork\models\gpt2')
max_length = 8
datas = [
"The dog.",
"The cute dog.",
]
model_input = tokenizer(datas)
print('original input:\n', model_input)
# prepare for batch input
# I add bos token at the start and eos token at the end, and add pad token at the right to pad the sentences to the
# same length. bos_token_id=eos_token_id=50256, and there is not a pad token, so i also use 50256 as pad token.
labels_list = []
for i in range(len(datas)):
input_ids = [tokenizer.bos_token_id] + model_input['input_ids'][i] + [tokenizer.eos_token_id] # add bos and eos token
input_ids = input_ids + max(0, max_length-len(input_ids))*[tokenizer.eos_token_id] # add padding token
attention_mask = [1] + model_input['attention_mask'][i] + [1] # atten bos and eos token
attention_mask = attention_mask + max(0, max_length - len(attention_mask)) * [0] # dose't atten padding token
labels = [tokenizer.bos_token_id] + model_input['input_ids'][i] + [tokenizer.eos_token_id] # take loss for bos and eos
labels = labels + max(0, max_length - len(labels)) * [-100] # padding dose't take loss
model_input['input_ids'][i] = input_ids
model_input['attention_mask'][i] = attention_mask
labels_list.append(labels)
model_input['labels'] = labels_list
print('batch input:\n', model_input)
terminal output
original input:
{'input_ids': [[464, 3290, 13], [464, 13779, 3290, 13]],
'attention_mask': [[1, 1, 1], [1, 1, 1, 1]]}
batch input:
{'input_ids': [[50256, 464, 3290, 13, 50256, 50256, 50256, 50256], [50256, 464, 13779, 3290, 13, 50256, 50256, 50256]],
'attention_mask': [[1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 0, 0]],
'labels': [[50256, 464, 3290, 13, 50256, -100, -100, -100], [50256, 464, 13779, 3290, 13, 50256, -100, -100]]}
my question:
1. the method I take to format batch input, is it right?
2. why can't gpt2 tokenizer auto format batch input like bert tokenzier do?
3. in this pre-training [demo](https://huggingface.co/learn/nlp-course/en/chapter7/6?fw=pt#preparing-the-dataset),
I found that it dose't add bos and eos tokens, and add pad token only at the end of the sequence.
So I think, in the pre-training time only need to add pad token to keep the sequence length consistent.
But when it comes to fine-tuning, additional eos tokens need to be added, and eos needs take loss because the model needs to learn when to stop generating.
Am I right?