I tried to fine-tune a DialoGPT model on a new dataset for conversational purpose. I preprocessed the datas as follow :
#my dataset :
print(ds_train)
print(ds_train[0]['text'])
Output
Dataset({ features: [‘text’], num_rows: 48423 })
[S1]:Yup, you’re right. Please May I know where is the event conducted and I need the complete address?
[S2]:Please note down the complete address of the event happening. It’s at Cornerstone Craft Beer & Live Music, 2367 Shattuck Avenue. Your reservation is successful and have a great time there!
[S1]:Thanks much for the information you’ve given. Please can you help me to find some intermediate priced restaurant that provides Ethiopian kind of food.
[S2]:Yup! There is an Ethiopian Restaurant named Addis Restaurant providing excellent and authentic traditional Ethiopian cuisine located in Berkeley. Do you wish to reserve a table here?
[S1]:At what number they are reachable?
[S2]:
def tokenize_function(examples):
return tokenizer(examples["text"], padding='max_length', add_special_tokens =True, max_length=246)
tokenized_ds_train = ds_train.map(
tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)
tokenized_ds_train = tokenized_ds_train.add_column("labels", tokenized_ds_train[:]['input_ids'])
train_set = model.prepare_tf_dataset(
tokenized_datasets,
shuffle=True,
batch_size=1,
)
sample = train_set.as_numpy_iterator()
sample = sample.next()
print(tokenized_datasets)
print(train_set)
print(sample)
Output
Dataset({ features: [‘input_ids’, ‘attention_mask’, ‘labels’], num_rows: 48423 })
<PrefetchDataset element_spec=({‘input_ids’: TensorSpec(shape=(1, 246), dtype=tf.int64, name=None), ‘attention_mask’: TensorSpec(shape=(1, 246), dtype=tf.int64, name=None)}, TensorSpec(shape=(1, 246), dtype=tf.int64, name=None))>
({‘attention_mask’: array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), ‘input_ids’: array([[ 58, 4125, 3110, 352, 5974, 314, 765, 284, 711, 440, 9190, 440, 14918, 440, 3825, 319, 616, 3359, 13, 198, 58, 4125, 3110, 362, 5974, 921, 765, 284, 3350, 262, 3496, 440, 9190, 440, 14918, 440, 3825, 4291, 262, 3195, 11, 826, 30, 198, 58, 4125, 3110, 352, 5974, 1320, 318, 826, 13, 1867, 2099, 286, 3496, 318, 340, 30, 198, 58, 4125, 3110, 362, 5974, 632, 318, 5610, 739, 262, 12136, 6536, 290, 534, 3496, 468, 2067, 13, 198, 58, 4125, 3110, 352, 5974, 20558, 617, 1637, 329, 502, 13, 198, 58, 4125, 3110, 362, 5974, 220, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257]])}, array([[ 58, 4125, 3110, 352, 5974, 314, 765, 284, 711, 440, 9190, 440, 14918, 440, 3825, 319, 616, 3359, 13, 198, 58, 4125, 3110, 362, 5974, 921, 765, 284, 3350, 262, 3496, 440, 9190, 440, 14918, 440, 3825, 4291, 262, 3195, 11, 826, 30, 198, 58, 4125, 3110, 352, 5974, 1320, 318, 826, 13, 1867, 2099, 286, 3496, 318, 340, 30, 198, 58, 4125, 3110, 362, 5974, 632, 318, 5610, 739, 262, 12136, 6536, 290, 534, 3496, 468, 2067, 13, 198, 58, 4125, 3110, 352, 5974, 20558, 617, 1637, 329, 502, 13, 198, 58, 4125, 3110, 362, 5974, 220, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257]]))
but after a long training I get this kind of output :
a_dialog = ('[S1]: Does money buy happiness?\n' +
'[S2]: Not entirely, money can contribute to be happy but do not have to be the source of your happiness.\n' +
'[S1]: So, how to be happy?\n'+
'[S2]:')
a_tokenized_dialog = tokenizer.encode(a_dialog, return_tensors="tf")
outputs = model.generate(a_tokenized_dialog, max_length=250)
print(tokenizer.decode(outputs[0]))
Output
[S1]: Does money buy happiness?
[S2]: Not entirely, money can contribute to be happy but do not have to be the source of your happiness.
[S1]: So, how to be happy?
[S2]:I want to be happy by eating food and by watching movies online.
[S1]:What type of food you want?
[S2]:[PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD]
I would like to know why the [PAD] token continues to be generated and how I can solve this issue.