Cross Entropy Weighted

Hi all,

I am using this Notebook created by @valhalla to fine tune T5 model in my own classification task. I would like to apply some kind of class weighting in my loss function, since I am dealing with highly imbalanced data. I have tried this so far:

def forward(
      self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, lm_labels=None
  ):  # in lightning, forward defines the prediction/inference actions
    return self.model(
        input_ids,
        attention_mask=attention_mask,
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=decoder_attention_mask,
        lm_labels=lm_labels
    )

  def _step(self, batch):
    lm_labels = batch["target_ids"]
    lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

    outputs = self(
        input_ids=batch["source_ids"],
        attention_mask=batch["source_mask"],
        lm_labels=lm_labels,
        decoder_attention_mask=batch['target_mask']
    )
    logits = outputs[1]

    ##### IMBALANCE LEARNING
    class_weights = torch.FloatTensor(self.hparams.class_weights).cuda()
    loss_fct = CrossEntropyLoss(ignore_index=-100, weight=class_weights)
    loss = loss_fct(logits, lm_labels)
    return loss

But it doesn’t work. I am passing a class_weight list of two elements (the number of classes) by parameter.

I think I don’t fully understand how the loss is computed using the logits and the labels.
I would appreciate any help, since I am pretty stuck.

Best,
Marcos

Which weights did you assign and what do you mean by “it does not work”? Do you get an error? If so, post the full error trace.

Hi @BramVanroy,

Thanks for your quick reply. As I said, I am trying to implement a binary classification task, but the data is imbalanced. So the weights that I used were self.hparams.class_weights = [1, 7.48] (in this list form).

The error is the following:

ValueError: Expected target size (8, 32128), got torch.Size([8, 2])

Since the logits tensor has shape (8, 2, 32128), and the labels tensor, (8,2). However this code worked perfectly like this:

def forward(
      self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, lm_labels=None
  ):  # in lightning, forward defines the prediction/inference actions
    return self.model(
        input_ids,
        attention_mask=attention_mask,
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=decoder_attention_mask,
        lm_labels=lm_labels
    )

  def _step(self, batch):
    lm_labels = batch["target_ids"]
    lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

    outputs = self(
        input_ids=batch["source_ids"],
        attention_mask=batch["source_mask"],
        lm_labels=lm_labels,
        decoder_attention_mask=batch['target_mask']
    )
    loss = outputs[0]
    return loss

The problem is that in that way, the model learns to always return the majority class. That’s why I tried to implement another CrossEntropyLoss using the weights:

##### IMBALANCE LEARNING
    class_weights = torch.FloatTensor(self.hparams.class_weights).cuda()
    loss_fct = CrossEntropyLoss(ignore_index=-100, weight=class_weights)
    loss = loss_fct(logits, lm_labels)
    return loss

I hope I made it clearer now.

Marcos

Hi @BramVanroy,

IMPORTANT UPDATE:

I have trying different things. First of all, I have checked that the loss produced by the model is the same as the CrossEntropy one without class_weights:

  def _step(self, batch):
    lm_labels = batch["target_ids"]
    lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

    outputs = self(
        input_ids=batch["source_ids"],
        attention_mask=batch["source_mask"],
        lm_labels=lm_labels,
        decoder_attention_mask=batch['target_mask']
    )
    print("Loss 1", outputs[0])
    logits = outputs[1]
    loss_fct = CrossEntropyLoss(ignore_index=-100)
    loss = loss_fct( logits.view(-1, logits.size(-1)), lm_labels.view(-1))
    print("Loss 2", loss)
    return loss

This works, perfectly. However, when I introduce class_weights like this:

  def _step(self, batch):
    lm_labels = batch["target_ids"]
    lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

    outputs = self(
        input_ids=batch["source_ids"],
        attention_mask=batch["source_mask"],
        lm_labels=lm_labels,
        decoder_attention_mask=batch['target_mask']
    )
    print("Loss 1", outputs[0])
    logits = outputs[1]
    class_weights = torch.FloatTensor([1, 7.8]).cuda()
    loss_fct = CrossEntropyLoss(ignore_index=-100, weight=class_weights)
    loss = loss_fct( logits.view(-1, logits.size(-1)), lm_labels.view(-1))
    print("Loss 2", loss)
    return loss

I get the following error:

RuntimeError: weight tensor should be defined either for all 32128 classes or no classes but got weight tensor of shape: [2]

Why is this happening? Since it is a binary classification problem.

Best,
Marcos

What’s the shape of logits?

The logits have (8, 2, 32128) shape, and the labels (8,2). After applying the view and size functions the final shapes are (16, 32128) for the logits and (16, 2) for the labels:

logits.view(-1, logits.size(-1)), lm_labels.view(-1)

Any toughts?

What kind of model are you using? The criterion expects a tensor of probabilities (x batch size). So it seems like you do not have a final classifier layer at the top or that you pass the arguments incorrectly.

EDIT: I was wrong. We are talking about T5 which sees every problem as text-to-text.

Hi @BramVanroy,

I am using T5ForConditionalGeneration. The batch size is 8, and the vocabulary size is 32128.

What do you mean by a final layer? Like a softmax? I assumed that CrossEntropyLoss did that internally.

FYI, when I predict the probs for the test examples I use the following code:

logits = logits.squeeze(1)
selected_logits = logits[:, [12153, 2024]] 
print(selected_logits)
probs = torch.nn.Softmax(selected_logits)

Do you mean something like this?
Marcos

I was wrong in my earlier statements. I was not taking into account that this is about T5 which formulates every problem as a text-to-text problem where the output labels are indeed “text” as taken from the vocab.

I am not sure how you can use weighted cross entropy loss here because the labels are not necessarily just one token (which would be easy). I’ll let @valhalla take this one.

But please do not “topic hijack” other topics (T5 user defined loss function - #14 by peggy).

Hi @BramVanroy,

Yes, that’s the option… It does not work as a normal BERT model for example.
I think I managed to solved it. I created a class weight tensor with the same size as the vocabulary, and I filled it with zeros except the positions that encode the labels (words ‘incorrect’, and ‘correct’ in this case):

class_weights = np.zeros(logits.shape[-1])
    class_weights[12153] = 7.48
    class_weights[2024] = 1
    class_weights_t = torch.from_numpy(class_weights).float().cuda()
    loss_fct = CrossEntropyLoss(ignore_index=-100, weight=class_weights_t)
    loss = loss_fct(logits.view(-1, logits.size(-1)), lm_labels.view(-1))

Sorry for the “topic hijack”, I am new to this kind of forums, and I just wanted a response to my problem hahaha

Thanks,
Marcos

Yes, that is what I said that is an option but that does not work well for longer options. If one token corresponds to one option, then you can do this - but if you want your code to work dynamically (also with larger labels that are tokenised with multi tokens), then this won’t work.

By the way, using torch.zeros(logits.shape[-1], device="cuda") will probably be faster.

I am curious to see whether @valhalla knows how to deal with this issue in a better, general way.

Yes, you are right. In this case , I was lucky since the labels are composed by single words.
Thanks for the cuda trick :wink:
Yeah, a more general solution will be fantastic. Anyway, thanks for your kind attention

1 Like

I’ve implemented a custom loss function for the general case that weights each individual loss value per token based on the class of the sample which generated the specific individual loss value.

The below example is for a simple Yes/No QA task. The model is a GPT2LMHeadModel but I don’t see why it wouldn’t work for any text output. I’m curious to see if others have thoughts on this, hopefully it can help someone else!

    for step, batch in enumerate(tqdm(train_dataloader)):
        batch = {k: v.to("cuda") for k, v in batch.items()}
        no_mask = batch["No"] # simple boolean mask of shape (batch_size,)
        del batch["No"]
        with torch.cuda.amp.autocast():
            outputs = model(**batch)

            # Weighted loss
            labels = batch["labels"]
            logits = outputs["logits"]
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            loss_fct = torch.nn.CrossEntropyLoss(reduction='none') # get individual loss values
            raw_loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
            raw_loss = raw_loss.view(shift_labels.shape) # reshape to (batch_size, seq_length)
            weights = (1 - no_mask) * yes_weight + no_mask * no_weight # no weight is scaled by relative tokenized length
            weights = weights.repeat(raw_loss.shape[-1], 1).T # creates weights of shape (batch_size, seq_length) where weights for a given sample a repeated to seq_length
            weighted_loss = raw_loss * weights
            loss_mask = shift_labels != ignore_index
            loss_masked = torch.masked_select(weighted_loss, loss_mask)
            weights_masked = torch.masked_select(weights, loss_mask)
            loss = loss_masked.sum() / weights_masked.sum() # mean reduce loss with weights as in CELoss

        loss.backward()

(credit for the neat loss reduction lines from this post on the pytorch forum)