Labels in language modeling: which tokens to set to -100?

I am confused on how we should use “labels” when doing non-masked language modeling tasks (for instance, the labels in OpenAIGPTDoubleHeadsModel).

I found this example on how to use OpenAI GPT for roc stories,

And here it seems that the tokens in the continuation part are set to -100, and not the context (i.e., the other inputs). I also found this discussion here:
ttps://discuss.huggingface.co/t/gpt2-for-qa-pair-generation/759

Which seems to suggest that the context (the question) is what has to be set to -100, and what has to be generated not (the answer?).

So my question is, which component should be set to -100 when doing language modeling? The tokens that we want to predict or the tokens that are there for extra information (“the context”, “the question” for which the model needs to generate an answer) etc.

Hi @Kwiebes1995!
Let me try to clear some things up from that post. I think the title is a bit misleading as it says QA Pairs, but ultimately I was interested in question generation. Let’s assume for this discussion that we are working in question generation, i.e. I want GPT2 to generate a relevant question based off a context and answer.

I carried out the finetuning on this task as follows:

  • Create a finetuning set in the following format:
text_str = 'context: 42 is the answer to life, the universe and everything. answer: 42. question: What is the answer to life, universe and everything ?'
  • After encoding an example with a tokenizer, set the attention mask to 0 for all text after the question: What is the... text, since this is the text we want to predict.
  • We will want to calculate the loss on the question: What is the... text. To do this we need to set the label value for everything that comes before the question: What is the... text to -100. This will ensure that cross entropy ignores that part of the example.

Here is an explicit piece of code that should help with what has been described:

def qgen_data_collator(text_list: List[str]) -> dict:
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    tokenizer.pad_token = tokenizer.eos_token
    q_id = tokenizer(' question', return_tensors='pt')['input_ids'][0][0]

    encoded_results = tokenizer(text_list, padding=True, truncation=True, return_tensors='pt',
                                     return_attention_mask=True)
    
    q_idxs = (encoded_results['input_ids'] == q_id).nonzero()
    for idx, attn_mask in enumerate(encoded_results['attention_mask']):
        attn_mask[q_idxs[idx][1]:] = 0

    tmp_labels = []
    for idx, input_id in enumerate(encoded_results['input_ids']):
        label = input_id.detach().clone()
        label[:q_idxs[idx][1]] = -100
        tmp_labels.append(label)

    batch = {}
    batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']])
    batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']])
    batch['labels'] = torch.stack([result for result in tmp_labels])
    return batch

This worked for Transformers 3.0.2. To summarize, the attention_mask for the text you want to predict gets set to 0. The labels value for the text that is not being predicted gets set to -100.

Let me know if that clears things up.

3 Likes