Labels in language modeling: which tokens to set to -100?

Kwiebes1995 · November 29, 2020, 12:01pm

I am confused on how we should use “labels” when doing non-masked language modeling tasks (for instance, the labels in OpenAIGPTDoubleHeadsModel).

I found this example on how to use OpenAI GPT for roc stories,

github.com

huggingface/transformers/blob/master/examples/contrib/run_openai_gpt.py

# coding=utf-8
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" OpenAI GPT model fine-tuning script.
    Adapted from https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/train.py
    It self adapted from https://github.com/openai/finetune-transformer-lm/blob/master/train.py

    This script with default values fine-tunes and evaluate a pretrained OpenAI GPT on the RocStories dataset:

This file has been truncated. show original

And here it seems that the tokens in the continuation part are set to -100, and not the context (i.e., the other inputs). I also found this discussion here:
ttps://discuss.huggingface.co/t/gpt2-for-qa-pair-generation/759

Which seems to suggest that the context (the question) is what has to be set to -100, and what has to be generated not (the answer?).

So my question is, which component should be set to -100 when doing language modeling? The tokens that we want to predict or the tokens that are there for extra information (“the context”, “the question” for which the model needs to generate an answer) etc.

aclifton314 · November 30, 2020, 7:20pm

Hi @Kwiebes1995!
Let me try to clear some things up from that post. I think the title is a bit misleading as it says QA Pairs, but ultimately I was interested in question generation. Let’s assume for this discussion that we are working in question generation, i.e. I want GPT2 to generate a relevant question based off a context and answer.

I carried out the finetuning on this task as follows:

Create a finetuning set in the following format:

text_str = 'context: 42 is the answer to life, the universe and everything. answer: 42. question: What is the answer to life, universe and everything ?'

After encoding an example with a tokenizer, set the attention mask to 0 for all text after the question: What is the... text, since this is the text we want to predict.
We will want to calculate the loss on the question: What is the... text. To do this we need to set the label value for everything that comes before the question: What is the... text to -100. This will ensure that cross entropy ignores that part of the example.

Here is an explicit piece of code that should help with what has been described:

def qgen_data_collator(text_list: List[str]) -> dict:
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    tokenizer.pad_token = tokenizer.eos_token
    q_id = tokenizer(' question', return_tensors='pt')['input_ids'][0][0]

    encoded_results = tokenizer(text_list, padding=True, truncation=True, return_tensors='pt',
                                     return_attention_mask=True)
    
    q_idxs = (encoded_results['input_ids'] == q_id).nonzero()
    for idx, attn_mask in enumerate(encoded_results['attention_mask']):
        attn_mask[q_idxs[idx][1]:] = 0

    tmp_labels = []
    for idx, input_id in enumerate(encoded_results['input_ids']):
        label = input_id.detach().clone()
        label[:q_idxs[idx][1]] = -100
        tmp_labels.append(label)

    batch = {}
    batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']])
    batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']])
    batch['labels'] = torch.stack([result for result in tmp_labels])
    return batch

This worked for Transformers 3.0.2. To summarize, the attention_mask for the text you want to predict gets set to 0. The labels value for the text that is not being predicted gets set to -100.

Let me know if that clears things up.

Topic		Replies	Views
Processing the [-100] Mask in SFT 🤗Transformers	2	1011	April 9, 2024
How to fine-tune "openai-gpt" model for sequence classification? 🤗Transformers	3	1311	September 5, 2024
Multi-label token classification: "-100" special label 🤗Transformers	1	495	September 18, 2023
OpenAIGPTLMHeadModel: How to adapt example code snippet to compute probability of a sentence? Beginners	0	535	November 25, 2020
When i use TFGPT2LMHeadModel, how can i build labels?labels = inputs_ids or labels = inputs_ids[1:] 🤗Transformers	0	361	July 18, 2022

Labels in language modeling: which tokens to set to -100?

Related topics