GPT2 for QA Pair Generation

aclifton314 · August 18, 2020, 9:59pm

I was wondering if it were possible to somehow train GPT2 to generate question-answer pairs in a particular domain?

valhalla · August 19, 2020, 9:08am

I’ve tried this with seq2seq models. I have worked on qa pair generation (separately) using T5 with descent results. You can find it here.

One way we can do this with GPT-2 is prepare our input like this
Our context is 42 is the answer to life, the universe and everything , answer is 42 and target question is What is the answer to life, universe and everything ?

Then
input text: context: 42 is the answer to life, the universe and everything. question: What is the answer to life, universe and everything ? answer: 42

and prepare the attention mask such that, there will be no attention from question: ... part, so model won’t look into future tokens and calculate loss only on the question: ... part. And it inference time we will feed only the context part and ask the model to generate the question.

This just one one way I can think of the of my mind. Feel free to correct me if this is wrong.

aclifton314 · August 19, 2020, 6:37pm

@valhalla Thanks for your response. That’s an interesting approach! Does that still require humans to create training “context” strings for gpt2?

aclifton314 · October 12, 2020, 7:51pm

@valhalla If I understand this correctly:

The input text will look like context: 42 is the answer to life, the universe and everything. question: What is the answer to life, universe and everything ? answer: 42
Mask out the question part so the new text will look like
context: 42 is the answer to life, the universe and everything. <BIG MASK> answer: 42
That is what gets fed as input text into the GPT2 model

Does this mean I define the labels into the model as the text that is masked?

valhalla · October 13, 2020, 7:25am

By mask, I meant attention_mask , the attention_mask should be zero on the text you want to predict, so the model won’t peek into future.
So if you want to generate question and answer, then the question and answer tokens should have 0
in attention mask.

aclifton314 · October 13, 2020, 11:19am

Ah yes, sorry for my misunderstanding. So we mask out the parts we want to predict by setting the attention_mask of those tokens to 0.

With these tokens masked in attention_mask, do we then pass it and the input string to GPT2 and train it with the language model head with no labels?

valhalla · October 13, 2020, 3:06pm

You’ll still need to pass labels for training.

Training will be same as training any GPT-2 model, only difference is the attention_mask

aclifton314 · October 13, 2020, 3:53pm

If I only wanted to generate questions, would I set the attention_mask for those tokens to 0 and use their text as the labels? Something like:

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
def my_data_collator(text_str):
    encoded_results = tokenizer(text_str, padding=True, truncation=True, return_tensors='pt',
                                     return_attention_mask=True)
    enncoded_results['attention_mask'] = set_my_attention_mask(encoded_results) #function to set attention mask to 0 on tokens in the question:... part of text_str
    label_ids = get_my_label_str(encoded_results['input_ids']) #function to return list of token ids for question:... part of text_str

    batch = {}
    batch['input_ids'] = encoded_results['input_ids']
    batch['past'] = None
    batch['attention_mask'] = encoded_results['attention_mask']
    batch['position_ids'] = None
    batch['head_mask'] = None
    batch['inputs_embeds'] = None
    batch['labels'] = label_ids
    batch['use_cache'] = True
    return batch

text_str = 'context: 42 is the answer to life, the universe and everything. question: What is the answer to life, universe and everything ? answer: 42'

And batch would get passed to a GPT2LMHeadModel?

valhalla · October 13, 2020, 4:09pm

This seems correct. One more thing to add, you can calculate loss only on the question: ... part.

To do this set labels to -100 for tokens before the question: part, so cross entropy will ignore it.

Also you won’t need to explicitly set some arguments (position_ids, head_mask etc) to None.
They are by default None so it’s okay if don’t pass them. Will make the code more cleaner.

bilalghanem · March 23, 2022, 5:27pm

@valhalla if we set the context labels to -100, this will make the model ignore the context while training. In other words, the generation of the questions won’t be based context-based. Am I right?

Topic		Replies	Views
Finetuned model generating test label exactly Beginners	0	463	October 15, 2020
How to teach a gpt-2 for Q&A? Models	0	2174	March 1, 2023
Generate desired text output based on model training Intermediate	3	336	December 17, 2024
Labels in language modeling: which tokens to set to -100? Beginners	1	3472	November 30, 2020
Batch generation with GPT2 🤗Transformers	12	17190	January 16, 2024

GPT2 for QA Pair Generation

Related topics