Generating [PAD] tokens during GPT2 inference

tessan · August 22, 2022, 4:01pm

I tried to fine-tune a DialoGPT model on a new dataset for conversational purpose. I preprocessed the datas as follow :

#my dataset : 
print(ds_train)
print(ds_train[0]['text'])

Output

Dataset({ features: [‘text’], num_rows: 48423 })

[S1]:Yup, you’re right. Please May I know where is the event conducted and I need the complete address?
[S2]:Please note down the complete address of the event happening. It’s at Cornerstone Craft Beer & Live Music, 2367 Shattuck Avenue. Your reservation is successful and have a great time there!
[S1]:Thanks much for the information you’ve given. Please can you help me to find some intermediate priced restaurant that provides Ethiopian kind of food.
[S2]:Yup! There is an Ethiopian Restaurant named Addis Restaurant providing excellent and authentic traditional Ethiopian cuisine located in Berkeley. Do you wish to reserve a table here?
[S1]:At what number they are reachable?
[S2]:


def tokenize_function(examples):
    return tokenizer(examples["text"],  padding='max_length', add_special_tokens =True, max_length=246)

tokenized_ds_train = ds_train.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

tokenized_ds_train = tokenized_ds_train.add_column("labels", tokenized_ds_train[:]['input_ids']) 

train_set = model.prepare_tf_dataset(
    tokenized_datasets,
    shuffle=True,
    batch_size=1,
)
sample = train_set.as_numpy_iterator()
sample = sample.next()

print(tokenized_datasets)
print(train_set)
print(sample)

Output

Dataset({ features: [‘input_ids’, ‘attention_mask’, ‘labels’], num_rows: 48423 })

<PrefetchDataset element_spec=({‘input_ids’: TensorSpec(shape=(1, 246), dtype=tf.int64, name=None), ‘attention_mask’: TensorSpec(shape=(1, 246), dtype=tf.int64, name=None)}, TensorSpec(shape=(1, 246), dtype=tf.int64, name=None))>

({‘attention_mask’: array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), ‘input_ids’: array([[ 58, 4125, 3110, 352, 5974, 314, 765, 284, 711, 440, 9190, 440, 14918, 440, 3825, 319, 616, 3359, 13, 198, 58, 4125, 3110, 362, 5974, 921, 765, 284, 3350, 262, 3496, 440, 9190, 440, 14918, 440, 3825, 4291, 262, 3195, 11, 826, 30, 198, 58, 4125, 3110, 352, 5974, 1320, 318, 826, 13, 1867, 2099, 286, 3496, 318, 340, 30, 198, 58, 4125, 3110, 362, 5974, 632, 318, 5610, 739, 262, 12136, 6536, 290, 534, 3496, 468, 2067, 13, 198, 58, 4125, 3110, 352, 5974, 20558, 617, 1637, 329, 502, 13, 198, 58, 4125, 3110, 362, 5974, 220, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257]])}, array([[ 58, 4125, 3110, 352, 5974, 314, 765, 284, 711, 440, 9190, 440, 14918, 440, 3825, 319, 616, 3359, 13, 198, 58, 4125, 3110, 362, 5974, 921, 765, 284, 3350, 262, 3496, 440, 9190, 440, 14918, 440, 3825, 4291, 262, 3195, 11, 826, 30, 198, 58, 4125, 3110, 352, 5974, 1320, 318, 826, 13, 1867, 2099, 286, 3496, 318, 340, 30, 198, 58, 4125, 3110, 362, 5974, 632, 318, 5610, 739, 262, 12136, 6536, 290, 534, 3496, 468, 2067, 13, 198, 58, 4125, 3110, 352, 5974, 20558, 617, 1637, 329, 502, 13, 198, 58, 4125, 3110, 362, 5974, 220, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257]]))

but after a long training I get this kind of output :

a_dialog = ('[S1]: Does money buy happiness?\n' +
            '[S2]: Not entirely, money can contribute to be happy but do not have to be the source of your happiness.\n' +
            '[S1]: So, how to be happy?\n'+
            '[S2]:')
a_tokenized_dialog = tokenizer.encode(a_dialog, return_tensors="tf")
outputs = model.generate(a_tokenized_dialog, max_length=250)
print(tokenizer.decode(outputs[0]))

Output

[S1]: Does money buy happiness?
[S2]: Not entirely, money can contribute to be happy but do not have to be the source of your happiness.
[S1]: So, how to be happy?
[S2]:I want to be happy by eating food and by watching movies online.
[S1]:What type of food you want?
[S2]:[PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD]

I would like to know why the [PAD] token continues to be generated and how I can solve this issue.

Topic		Replies	Views
How does GPT decide to stop generating sentences without EOS token? 🤗Transformers	13	24372	August 19, 2024
[Help appreciated] GPT2 Finetuning results in Only Padding output 🤗Transformers	2	1609	June 5, 2023
Mistral trouble when fine-tuning : Don't set pad_token_id = eos_token_id 🤗Transformers	8	5765	August 28, 2024
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation Beginners	5	46183	September 24, 2024
Gpt2 token of specific string 🤗Transformers	0	295	March 30, 2023

Generating [PAD] tokens during GPT2 inference

Output

Output

Output

Related topics