Which seems to suggest that the context (the question) is what has to be set to -100, and what has to be generated not (the answer?).
So my question is, which component should be set to -100 when doing language modeling? The tokens that we want to predict or the tokens that are there for extra information (“the context”, “the question” for which the model needs to generate an answer) etc.
Hi @Kwiebes1995!
Let me try to clear some things up from that post. I think the title is a bit misleading as it says QA Pairs, but ultimately I was interested in question generation. Let’s assume for this discussion that we are working in question generation, i.e. I want GPT2 to generate a relevant question based off a context and answer.
I carried out the finetuning on this task as follows:
Create a finetuning set in the following format:
text_str = 'context: 42 is the answer to life, the universe and everything. answer: 42. question: What is the answer to life, universe and everything ?'
After encoding an example with a tokenizer, set the attention mask to 0 for all text after the question: What is the... text, since this is the text we want to predict.
We will want to calculate the loss on the question: What is the... text. To do this we need to set the label value for everything that comes before the question: What is the... text to -100. This will ensure that cross entropy ignores that part of the example.
Here is an explicit piece of code that should help with what has been described:
def qgen_data_collator(text_list: List[str]) -> dict:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
q_id = tokenizer(' question', return_tensors='pt')['input_ids'][0][0]
encoded_results = tokenizer(text_list, padding=True, truncation=True, return_tensors='pt',
return_attention_mask=True)
q_idxs = (encoded_results['input_ids'] == q_id).nonzero()
for idx, attn_mask in enumerate(encoded_results['attention_mask']):
attn_mask[q_idxs[idx][1]:] = 0
tmp_labels = []
for idx, input_id in enumerate(encoded_results['input_ids']):
label = input_id.detach().clone()
label[:q_idxs[idx][1]] = -100
tmp_labels.append(label)
batch = {}
batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']])
batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']])
batch['labels'] = torch.stack([result for result in tmp_labels])
return batch
This worked for Transformers 3.0.2. To summarize, the attention_mask for the text you want to predict gets set to 0. The labels value for the text that is not being predicted gets set to -100.