What is the context as per run_clm?

I am using a dataset from dataset library for text generation task. For each example in dataset, I want to provide sentence1 and label as context, and allow network to generate sentence2 and this should be used for loss calculation. In the run_clm script, I’m not able to find this distinction as to what is being used as context.

Based on my understanding attention_mask and token_type_ids are responsible for controlling what network sees and what not. So from this, I want label+sentence1 to have 0s and sentence2 to have 1s in attention_mask.

def tokenize_function(examples):
    tokenized_input = tokenizer(LABELS[examples['label']], examples['sentence1'])
    # not sure how to manage examples['sentence2']
    return tokenized_input

I want the output to be [CLS] [SEP] label [SEP] sentence 1 [SEP] sentence 2 (to be generated) [SEP]

Can anyone clarify ?

If I understand your question correctly, I think you want to calculate loss using only the generated text.
This is not handled by run_clm script.

To handle this, you can prepare the labels such that, all the tokens will have -100 as value except the ones you want to include in loss, that way the -100 tokens will be ignored by the CrossEntropy loss function.

Thanks @valhalla for your response.

You mean like this ? Is that where 100 goes ?

# inside group_texts function
marker =  [i for i in enumerate(examples.input_ids) if i==985][1] # 985 is id of ']', we take second occurence.
result["labels"] = result["input_ids"].copy()
result["labels"][:marker+1] = [-100]*(marker+1)

I’m a bit confused about labels here. In the the script I see result["labels"] = result["input_ids"].copy which just means we are taking everything as label and then perform CLM on both sentences. Also what about attention_mask and token_type_ids. I’m not clear on where to specify the context, and where to specify label (i.e sentence2).

Can you please clarify how to go about it ?

Based on your suggestion, this is what I did.

def group_texts(examples):
        # Concatenate all texts.
        second_sentence_start_pos = [i for i, n in enumerate(examples['input_ids']) if n == 985][1]
        token_type_ids = examples["token_type_ids"].copy()
        labels = examples["input_ids"].copy()

        token_type_ids[:second_sentence_start_pos+1] = [0]*(second_sentence_start_pos+1)
        token_type_ids[second_sentence_start_pos+1:] = [1]*(len(examples['input_ids'])-second_sentence_start_pos)
        labels[:second_sentence_start_pos+1] = [-100]*(second_sentence_start_pos+1)

        examples["token_type_ids"] = token_type_ids
        examples["labels"] = labels
        return examples

Is this what you meant ? Is this the right way to deal with attention_masks and token_type_ids ?

Since GPT-2 is an auto-regressive decoder, we need to feed both the context and target as the input text.

In the the script I see result["labels"] = result["input_ids"].copy which just means we are taking everything as label and then perform CLM on both sentences

Since it’s a CLM, during training it predicts next token for every token in the input.
So we pass the labeld=input_ids and then lables are shifted to right inside the model.

i.e if input_ids = [BOS] tok1 tok2 tok3....
then labels= tok1 tok2 tok3

and all the tokens in the labels are considered when calculating the loss.

Now if we want to use a CLM for tasks that require context and target (let’s say a summarization task) and we only want the model to generate a target (summary) then

During Trainig:

we need to concatenate both the context (source_text) and target (summary) as one single input text
i.e input = [BOS] source_text [SEP] summary

and prepare labels such that tokens corresponding to the context (source text) will have -100 as the value. This way we ensure that the loss is only calculated for summary tokens so the model will (hopefully) only learn to generate summary (or target).

And During inference
we would feed the source text with sep and ask the model to generate the next tokens which will be summary in this case i.e
input = [BOS] source text [SEP]

Also, the attention_mask here is the same as attention_mask in every other model, it specifies which tokens in the input should be attended and which should be ignored.

For reference you could refer to this conv ai example which uses this approach for Conversational agent with GPT2

Hope this helps

Thanks for the response. I am using CTRL for my task and not much is available on it regarding how to finetune it properly. I have opened up a seperate thread but no response.
So attention_mask stays intact because we require the model to attend to the context. As you’re saying we do control the learning entirely by [-100] entries in labels, our attention_mask is basically a tensor filled with 1s. Coming to token_type_ids, I am making the parts that need to be generated 1s, rest 0s as per BERT documentation.
Now for labels, I’ve made everything [-100] except the parts that needs to be generated and shifting automatically happens inside the model. Having done this, I’m still not able to get good generations.
For CTRL, I don’t think we use a SEP token, I didn’t see it being used in original repo. Also the tokenizer (BPE based) is working quite differently i.e it is not adding special tokens as one would expect.
So my sentences were like [Custom control Code] [Sentence1] [Sentence2].

The documentation says this
The PyTorch models can take the past as input, which is the previously computed key/value attention pairs. but this is only applicable during inference right ?

Thanks @valhalla, I think the most important part was dealing with labels. I was performing shifting manually. Now generations have improved.