I’m trying to fine-tune BART for paraphrasing, it’s my first time fine-tuning a model so I think I’m doing something wrong…
The input data is just pairs of sentences. I managed to get the training working, at least apparently. Here’s the code:
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')
model.train()
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5)
for i_epoch in tqdm(range(3), desc="Epoch"):
print("Epoch #", i_epoch)
for i_batch, data in enumerate(tqdm(train_dataloader, desc="Training batches")):
outputs = model(data['train_ids'],
attention_mask=data['train_att'],
decoder_input_ids=data['val_ids'],
decoder_attention_mask=data['val_att'],
labels=data['val_ids']
)
loss = outputs[0]
if i_batch % 10 == 0:
print("Batch", i_batch, " loss =", loss)
loss.backward()
optimizer.step()
I guess the first question is if this looks correct? The data
from the dataloader is populated with the input_ids
and attention_mask
results from tokenizer()
on the training and validation sets. I wanted to write out the training myself instead of using Trainer
for learning purposes.
Looking at the loss alone, this kinda seems to work? Here’s the output:
Epoch # 0
Training batches: 100%
90/90 [3:55:25<00:00, 156.96s/it]
Batch 0 loss = tensor(14.6109, grad_fn=<NllLossBackward>)
Batch 10 loss = tensor(11.6407, grad_fn=<NllLossBackward>)
Batch 20 loss = tensor(10.6421, grad_fn=<NllLossBackward>)
Batch 30 loss = tensor(9.4968, grad_fn=<NllLossBackward>)
Batch 40 loss = tensor(8.4232, grad_fn=<NllLossBackward>)
Batch 50 loss = tensor(6.9087, grad_fn=<NllLossBackward>)
Batch 60 loss = tensor(5.8986, grad_fn=<NllLossBackward>)
Batch 70 loss = tensor(5.3631, grad_fn=<NllLossBackward>)
Batch 80 loss = tensor(4.9015, grad_fn=<NllLossBackward>)
Epoch # 1
Training batches: 100%
90/90 [1:57:47<00:00, 78.53s/it]
Batch 0 loss = tensor(4.5883, grad_fn=<NllLossBackward>)
Batch 10 loss = tensor(4.1895, grad_fn=<NllLossBackward>)
Batch 20 loss = tensor(3.7548, grad_fn=<NllLossBackward>)
Batch 30 loss = tensor(3.4811, grad_fn=<NllLossBackward>)
Batch 40 loss = tensor(3.2216, grad_fn=<NllLossBackward>)
Batch 50 loss = tensor(2.9044, grad_fn=<NllLossBackward>)
Batch 60 loss = tensor(2.3631, grad_fn=<NllLossBackward>)
Batch 70 loss = tensor(2.1639, grad_fn=<NllLossBackward>)
Batch 80 loss = tensor(1.8803, grad_fn=<NllLossBackward>)
Epoch # 2
Training batches: 100%
90/90 [1:57:34<00:00, 78.39s/it]
Batch 0 loss = tensor(1.6901, grad_fn=<NllLossBackward>)
Batch 10 loss = tensor(1.7226, grad_fn=<NllLossBackward>)
Batch 20 loss = tensor(1.2894, grad_fn=<NllLossBackward>)
Batch 30 loss = tensor(0.9937, grad_fn=<NllLossBackward>)
Batch 40 loss = tensor(0.9841, grad_fn=<NllLossBackward>)
Batch 50 loss = tensor(0.9459, grad_fn=<NllLossBackward>)
Batch 60 loss = tensor(0.6868, grad_fn=<NllLossBackward>)
Batch 70 loss = tensor(0.6640, grad_fn=<NllLossBackward>)
Batch 80 loss = tensor(0.5731, grad_fn=<NllLossBackward>)
The loss is consistently going down over the course of the training, which I thought should be a basic sign that something is happening.
But, my generate code generates repetitive gibberish instead! Here’s the generation code:
test_input = "Yesterday they went to the park, and today they will go to the store."
test_inputs = tokenizer([test_input], return_tensors='pt')
summary_ids = model.generate(
test_inputs['input_ids'],
num_beams=12,
temperature=1.0,
num_return_sequences=10,
repetition_penalty=1.0,
do_sample=False
)
for res in [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]:
print(res)
The output is this:
For for for for for for for for for for for for for for for for for
Yesterday,,,,,,,,,,,,,,,,
The the the the the the the the the the the the the the the the the
The thethe the the the the the the the the the the the the the the
On on on on on on on on on on on on on on on on on
For for for for for for for for for for forfor for for for for for
The thethethe the the the the the the the the the the the the the
Yesterday Yesterday Yesterday Yesterday Yesterday,,,,,,,,,,,,
For for for for for for for for for for for for For for for for for
The the the the the the the the the the store store store store store store store
(In comparison, before fine-tuning, the output for this code is the same as the input)
Granted, this is fine-tuned on a small toy dataset of 10k sentence pairs mainly to test if the code runs. But seeing the output be repeated tokens like this makes me hesitant to try a bigger dataset. Surely 10k is enough data that the output should at least not be complete junk? So I’m wondering now if I did something simple wrong, or if it’s actually a data problem and I just need to train on a fuller dataset (1m+ examples?) to see results.