T5: Tips for finetuning on crossword clues (clue => answer)

As a baseline for a research project, I am trying to finetune T5 on a large crossword clue set (130,000 clues), where the source is a Clue, and the target is an Answer.

  • I am using T5ForConditionalGeneration and the finetune.py script (examples/seq2seq). I started with T5-small.
  • My source/target files have one pair per line (<Clue>\n, <Answer>\n).
  • I started with from_pretrained(t5-small) for both model and tokenizer.
  • I didn’t add any tokens to the vocabulary.

The initial run gave me only gibberish (long strings of entirely non-English outputs), so I am trying an even simpler task: Can T5 learn to select the first word of the input sentence. I.e., I’ve modified inputs and outputs to be something like
source: This is a clue with some normal language
target: This

Where again each entry is on its own line.

I observed (under the same training regime as above, with T5-small) that, after 300 epochs, the model gives outputs that look like
<first word> <long string of gibberish>

I wonder if anyone has some ideas:

  • Is there any issue with having only one word targets? I.e., Should I be using a different loss function than the default? At epoch 2, my loss was already down to 0.001. Rouge scores were around 1.5.
  • I did not change the task name. The finetune.py script seems to default to adding task name summarization. Maybe I should remove or change the task name?
  • How would T5 small vs T5 base or large change the results?
  • I did not add separator tokens, but I think I am not required to given that the example (e.g. finetune_bart_tiny) does not add separator tokens
  • My inputs and outputs generally do not have punctuation (i.e. the clues don’t end in a period, and the answers don’t end in a period). I wonder if this would help?
  • I read the through https://discuss.huggingface.co/t/t5-finetuning-tips/684 but I’m not sure if those tweaks will change the results here. I’m mostly just slightly adapting the given finetune_t5.sh script.
  • What’s a reasonable number of epochs of finetuning (using a 60% split of 130,000, so roughly 80k training examples) before I should expect the model to learn to output the first word?

I also have an implementation question:
Is there a way to get the finetune.py script to print validation results at every epoch so that I can see how the model is learning (qualitatively) over time?

I filed this bug for the gibberish outputs I am observing