What can cause model.generate (BART) output to be gibberish after fine-tuning?

I’m trying to fine-tune BART for paraphrasing, it’s my first time fine-tuning a model so I think I’m doing something wrong…

The input data is just pairs of sentences. I managed to get the training working, at least apparently. Here’s the code:

model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')
model.train()
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5)
for i_epoch in tqdm(range(3), desc="Epoch"):
  print("Epoch #", i_epoch)
  for i_batch, data in enumerate(tqdm(train_dataloader, desc="Training batches")):
    outputs = model(data['train_ids'],
                    attention_mask=data['train_att'],
                    decoder_input_ids=data['val_ids'],
                    decoder_attention_mask=data['val_att'],
                    labels=data['val_ids']
                    )
    loss = outputs[0]
    if i_batch % 10 == 0:
      print("Batch", i_batch, " loss =", loss)
    loss.backward()
    optimizer.step()

I guess the first question is if this looks correct? The data from the dataloader is populated with the input_ids and attention_mask results from tokenizer() on the training and validation sets. I wanted to write out the training myself instead of using Trainer for learning purposes.

Looking at the loss alone, this kinda seems to work? Here’s the output:

Epoch # 0
Training batches: 100%
90/90 [3:55:25<00:00, 156.96s/it]
Batch 0  loss = tensor(14.6109, grad_fn=<NllLossBackward>)
Batch 10  loss = tensor(11.6407, grad_fn=<NllLossBackward>)
Batch 20  loss = tensor(10.6421, grad_fn=<NllLossBackward>)
Batch 30  loss = tensor(9.4968, grad_fn=<NllLossBackward>)
Batch 40  loss = tensor(8.4232, grad_fn=<NllLossBackward>)
Batch 50  loss = tensor(6.9087, grad_fn=<NllLossBackward>)
Batch 60  loss = tensor(5.8986, grad_fn=<NllLossBackward>)
Batch 70  loss = tensor(5.3631, grad_fn=<NllLossBackward>)
Batch 80  loss = tensor(4.9015, grad_fn=<NllLossBackward>)

Epoch # 1
Training batches: 100%
90/90 [1:57:47<00:00, 78.53s/it]
Batch 0  loss = tensor(4.5883, grad_fn=<NllLossBackward>)
Batch 10  loss = tensor(4.1895, grad_fn=<NllLossBackward>)
Batch 20  loss = tensor(3.7548, grad_fn=<NllLossBackward>)
Batch 30  loss = tensor(3.4811, grad_fn=<NllLossBackward>)
Batch 40  loss = tensor(3.2216, grad_fn=<NllLossBackward>)
Batch 50  loss = tensor(2.9044, grad_fn=<NllLossBackward>)
Batch 60  loss = tensor(2.3631, grad_fn=<NllLossBackward>)
Batch 70  loss = tensor(2.1639, grad_fn=<NllLossBackward>)
Batch 80  loss = tensor(1.8803, grad_fn=<NllLossBackward>)

Epoch # 2
Training batches: 100%
90/90 [1:57:34<00:00, 78.39s/it]
Batch 0  loss = tensor(1.6901, grad_fn=<NllLossBackward>)
Batch 10  loss = tensor(1.7226, grad_fn=<NllLossBackward>)
Batch 20  loss = tensor(1.2894, grad_fn=<NllLossBackward>)
Batch 30  loss = tensor(0.9937, grad_fn=<NllLossBackward>)
Batch 40  loss = tensor(0.9841, grad_fn=<NllLossBackward>)
Batch 50  loss = tensor(0.9459, grad_fn=<NllLossBackward>)
Batch 60  loss = tensor(0.6868, grad_fn=<NllLossBackward>)
Batch 70  loss = tensor(0.6640, grad_fn=<NllLossBackward>)
Batch 80  loss = tensor(0.5731, grad_fn=<NllLossBackward>)

The loss is consistently going down over the course of the training, which I thought should be a basic sign that something is happening.

But, my generate code generates repetitive gibberish instead! Here’s the generation code:

test_input = "Yesterday they went to the park, and today they will go to the store."
test_inputs = tokenizer([test_input], return_tensors='pt')
summary_ids = model.generate(
    test_inputs['input_ids'],
    num_beams=12,
    temperature=1.0,
    num_return_sequences=10,
    repetition_penalty=1.0,
    do_sample=False
  )
for res in [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]:
  print(res)

The output is this:

For for for for for for for for for for for for for for for for for
Yesterday,,,,,,,,,,,,,,,,
The the the the the the the the the the the the the the the the the
The thethe the the the the the the the the the the the the the the
On on on on on on on on on on on on on on on on on
For for for for for for for for for for forfor for for for for for
The thethethe the the the the the the the the the the the the the
Yesterday Yesterday Yesterday Yesterday Yesterday,,,,,,,,,,,,
For for for for for for for for for for for for For for for for for
The the the the the the the the the the store store store store store store store

(In comparison, before fine-tuning, the output for this code is the same as the input)

Granted, this is fine-tuned on a small toy dataset of 10k sentence pairs mainly to test if the code runs. But seeing the output be repeated tokens like this makes me hesitant to try a bigger dataset. Surely 10k is enough data that the output should at least not be complete junk? So I’m wondering now if I did something simple wrong, or if it’s actually a data problem and I just need to train on a fuller dataset (1m+ examples?) to see results.

Do you need to zero your gradients for BART?

(I’ve not used Bart, but in training Bert I need to use model.zero_grad before passing each batch of data to the model).

Does your data look similar to the data Bart was originally trained on? If it is totally different then your model could get worse before it gets better. What are you hoping it will learn from your new data?

1 Like

Hi @rgwatwormhill, gradient need to be zeroed for every pytorch model, otherwise they get accumulated.

Hi @hyura
For Bart or any other seq2seq model the decoder_input_ids need to be shifted right i.e the decoder sequence need to start with decoder_start_token_id which is usually bos or pad or eos token. For Bart, it’s eos.

This means the decoder first takes the decoder_start_token_id and produces the first token in the labels. If it’s not shifted then it’s just copying whatever the token it has received at that step, which could be reason for this weird generation.

If you are on master version, then there are few helpers for preparing the data.

  • Use prepare_seq2seq_batch method, this will return input_ids , attention_mask and labels
  • use modeling_bart.shift_tokens_right to prepare the decoder_input_ids.
  • set pad token in the labels to -100 so they’ll be ignored by the cross entropy loss
input_text = "Some input text"
output_text = "Paraphrase Text"

#enc will contain input_ids, attention_mask and labels
enc = tokenizer.prepare_seq2seq_batch(src_texts=input_text, tgt_texts=output_text, return_tensors="pt") 
decoder_input_ids = shift_tokens_right(enc["labels"], tokenizer.pad_token_id)

labels = enc["labels"]
labels[labels == pad_token_id] = -100

Hope this helps.
cc @sshleifer

4 Likes

Thanks very much, will give that a try! Just to be clear - this is only done for the training right? Since model.generate() accepts different parameters (no labels) I assume it handles the shifting automatically?