I was wondering if it were possible to somehow train GPT2 to generate question-answer pairs in a particular domain?
I’ve tried this with seq2seq models. I have worked on qa pair generation (separately) using T5 with descent results. You can find it here.
One way we can do this with GPT-2 is prepare our input like this
Our context is 42 is the answer to life, the universe and everything
, answer is 42
and target question is What is the answer to life, universe and everything ?
Then
input text: context: 42 is the answer to life, the universe and everything. question: What is the answer to life, universe and everything ? answer: 42
and prepare the attention mask such that, there will be no attention from question: ...
part, so model won’t look into future tokens and calculate loss only on the question: ...
part. And it inference time we will feed only the context part and ask the model to generate the question.
This just one one way I can think of the of my mind. Feel free to correct me if this is wrong.
@valhalla Thanks for your response. That’s an interesting approach! Does that still require humans to create training “context” strings for gpt2?
@valhalla If I understand this correctly:
-
The input text will look like
context: 42 is the answer to life, the universe and everything. question: What is the answer to life, universe and everything ? answer: 42
-
Mask out the
question
part so the new text will look like
context: 42 is the answer to life, the universe and everything. <BIG MASK> answer: 42
-
That is what gets fed as input text into the GPT2 model
Does this mean I define the labels
into the model as the text that is masked?
By mask, I meant attention_mask
, the attention_mask
should be zero on the text you want to predict, so the model won’t peek into future.
So if you want to generate question and answer, then the question and answer tokens should have 0
in attention mask.
Ah yes, sorry for my misunderstanding. So we mask out the parts we want to predict by setting the attention_mask
of those tokens to 0.
With these tokens masked in attention_mask
, do we then pass it and the input string to GPT2 and train it with the language model head with no labels?
You’ll still need to pass labels
for training.
Training will be same as training any GPT-2 model, only difference is the attention_mask
If I only wanted to generate questions, would I set the attention_mask
for those tokens to 0 and use their text as the labels
? Something like:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
def my_data_collator(text_str):
encoded_results = tokenizer(text_str, padding=True, truncation=True, return_tensors='pt',
return_attention_mask=True)
enncoded_results['attention_mask'] = set_my_attention_mask(encoded_results) #function to set attention mask to 0 on tokens in the question:... part of text_str
label_ids = get_my_label_str(encoded_results['input_ids']) #function to return list of token ids for question:... part of text_str
batch = {}
batch['input_ids'] = encoded_results['input_ids']
batch['past'] = None
batch['attention_mask'] = encoded_results['attention_mask']
batch['position_ids'] = None
batch['head_mask'] = None
batch['inputs_embeds'] = None
batch['labels'] = label_ids
batch['use_cache'] = True
return batch
text_str = 'context: 42 is the answer to life, the universe and everything. question: What is the answer to life, universe and everything ? answer: 42'
And batch
would get passed to a GPT2LMHeadModel
?
This seems correct. One more thing to add, you can calculate loss only on the question: ...
part.
To do this set labels
to -100 for tokens before the question:
part, so cross entropy will ignore it.
Also you won’t need to explicitly set some arguments (position_ids
, head_mask
etc) to None
.
They are by default None
so it’s okay if don’t pass them. Will make the code more cleaner.
@valhalla if we set the context labels to -100, this will make the model ignore the context while training. In other words, the generation of the questions won’t be based context-based. Am I right?