Fine-tuning mT5 for Finnish QG

I’m trying to fine-tune mT5-base for Finnish QG using a machine-translated Finnish version of SQuAD1.1 and Finnish partition of the TyDi-QA GoldP data (~74k examples in total).

I’m using a highlightin technique where the input is in this format:

"Luo kysymys: <c1>, <c2>, ... [HL], <ci>,<cj>, [HL], ..., <cn>" (“Luo kysymys” = “Generate a question”)

, where <cx>s are tokens in the passage and the tokens inside the [HL] is the answer to the question.

However, the results so far haven’t been very promising. The outputs were quite strange, especially during the first 5 epochs. Almost every output started with <extra_id_0>. It was like the first part (one or several words) of the output was replaced with the tag. I read that fine-tuning an mT5 more epochs might make those tags disappear, so I fine-tuned 25 epochs in total. The <extra_id_0> did disappear from the outputs but they’re still of quite bad quality.

Here are some examples:

predicted: vuoden 1929 elokuvan ensimmäinen voittaja jaettiin?
reference; Minä vuonna jaettiin ensimmäiset Oscarit?

predicted: Suomen syvin?
reference; Mikä on Suomen suurin kanava?

One thing I find very odd is that almost none of the generated questions start with an interrogative word (e.g. “Mikä”, “Ketkä”, “Milloin” in Finnish). Almost as if they were somehow chopped off. The first word is rarely even capitalized. The data isn’t very high quality since it’s been mostly automatically translated, but I’ve somewhat successfully already used it to fine-tune BERT for QG using the BERT-HLSQG method. So I don’t think the data is the problem.

I wonder if there’s something wrong with the fine-tuning instead… here’s the code I’m using for that:
Script for fine-tuning a multilingual T5 model (mT5-base) for Finnish QG · GitHub

I’m using, Python 3.8.6, transformers 4.8.1 and torch 1.9.0+cu111 and training on two v100 GPUs.

Does anyone have ideas about what I could do to improve the model’s performance?

@valhalla I hope you don’t mind me tagging you but you seem to be an expert in QG.

I’d also appreciate suggestions on what other models I could try for QG. I’ve so far tried Finnish GPT-2 and BERT with moderate success.