GPT2 for QA Pair Generation

I was wondering if it were possible to somehow train GPT2 to generate question-answer pairs in a particular domain?

1 Like

I’ve tried this with seq2seq models. I have worked on qa pair generation (separately) using T5 with descent results. You can find it here.

One way we can do this with GPT-2 is prepare our input like this
Our context is 42 is the answer to life, the universe and everything , answer is 42 and target question is What is the answer to life, universe and everything ?

Then
input text: context: 42 is the answer to life, the universe and everything. question: What is the answer to life, universe and everything ? answer: 42

and prepare the attention mask such that, there will be no attention from question: ... part, so model won’t look into future tokens and calculate loss only on the question: ... part. And it inference time we will feed only the context part and ask the model to generate the question.

This just one one way I can think of the of my mind. Feel free to correct me if this is wrong.

1 Like

@valhalla Thanks for your response. That’s an interesting approach! Does that still require humans to create training “context” strings for gpt2?

@valhalla If I understand this correctly:

  1. The input text will look like context: 42 is the answer to life, the universe and everything. question: What is the answer to life, universe and everything ? answer: 42

  2. Mask out the question part so the new text will look like
    context: 42 is the answer to life, the universe and everything. <BIG MASK> answer: 42

  3. That is what gets fed as input text into the GPT2 model

Does this mean I define the labels into the model as the text that is masked?

By mask, I meant attention_mask , the attention_mask should be zero on the text you want to predict, so the model won’t peek into future.
So if you want to generate question and answer, then the question and answer tokens should have 0
in attention mask.

Ah yes, sorry for my misunderstanding. So we mask out the parts we want to predict by setting the attention_mask of those tokens to 0.

With these tokens masked in attention_mask, do we then pass it and the input string to GPT2 and train it with the language model head with no labels?

You’ll still need to pass labels for training.

Training will be same as training any GPT-2 model, only difference is the attention_mask

1 Like

If I only wanted to generate questions, would I set the attention_mask for those tokens to 0 and use their text as the labels? Something like:

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
def my_data_collator(text_str):
    encoded_results = tokenizer(text_str, padding=True, truncation=True, return_tensors='pt',
                                     return_attention_mask=True)
    enncoded_results['attention_mask'] = set_my_attention_mask(encoded_results) #function to set attention mask to 0 on tokens in the question:... part of text_str
    label_ids = get_my_label_str(encoded_results['input_ids']) #function to return list of token ids for question:... part of text_str

    batch = {}
    batch['input_ids'] = encoded_results['input_ids']
    batch['past'] = None
    batch['attention_mask'] = encoded_results['attention_mask']
    batch['position_ids'] = None
    batch['head_mask'] = None
    batch['inputs_embeds'] = None
    batch['labels'] = label_ids
    batch['use_cache'] = True
    return batch

text_str = 'context: 42 is the answer to life, the universe and everything. question: What is the answer to life, universe and everything ? answer: 42'

And batch would get passed to a GPT2LMHeadModel?

1 Like

This seems correct. One more thing to add, you can calculate loss only on the question: ... part.

To do this set labels to -100 for tokens before the question: part, so cross entropy will ignore it.

Also you won’t need to explicitly set some arguments (position_ids, head_mask etc) to None.
They are by default None so it’s okay if don’t pass them. Will make the code more cleaner.

3 Likes

@valhalla if we set the context labels to -100, this will make the model ignore the context while training. In other words, the generation of the questions won’t be based context-based. Am I right?