Cross Entropy Weighted

spartan97 · February 9, 2021, 3:06pm

Hi all,

I am using this Notebook created by @valhalla to fine tune T5 model in my own classification task. I would like to apply some kind of class weighting in my loss function, since I am dealing with highly imbalanced data. I have tried this so far:

def forward(
      self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, lm_labels=None
  ):  # in lightning, forward defines the prediction/inference actions
    return self.model(
        input_ids,
        attention_mask=attention_mask,
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=decoder_attention_mask,
        lm_labels=lm_labels
    )

  def _step(self, batch):
    lm_labels = batch["target_ids"]
    lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

    outputs = self(
        input_ids=batch["source_ids"],
        attention_mask=batch["source_mask"],
        lm_labels=lm_labels,
        decoder_attention_mask=batch['target_mask']
    )
    logits = outputs[1]

    ##### IMBALANCE LEARNING
    class_weights = torch.FloatTensor(self.hparams.class_weights).cuda()
    loss_fct = CrossEntropyLoss(ignore_index=-100, weight=class_weights)
    loss = loss_fct(logits, lm_labels)
    return loss

But it doesn’t work. I am passing a class_weight list of two elements (the number of classes) by parameter.

I think I don’t fully understand how the loss is computed using the logits and the labels.
I would appreciate any help, since I am pretty stuck.

Best,
Marcos

BramVanroy · February 9, 2021, 5:01pm

Which weights did you assign and what do you mean by “it does not work”? Do you get an error? If so, post the full error trace.

spartan97 · February 9, 2021, 5:08pm

Hi @BramVanroy,

Thanks for your quick reply. As I said, I am trying to implement a binary classification task, but the data is imbalanced. So the weights that I used were self.hparams.class_weights = [1, 7.48] (in this list form).

The error is the following:

ValueError: Expected target size (8, 32128), got torch.Size([8, 2])

Since the logits tensor has shape (8, 2, 32128), and the labels tensor, (8,2). However this code worked perfectly like this:

def forward(
      self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, lm_labels=None
  ):  # in lightning, forward defines the prediction/inference actions
    return self.model(
        input_ids,
        attention_mask=attention_mask,
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=decoder_attention_mask,
        lm_labels=lm_labels
    )

  def _step(self, batch):
    lm_labels = batch["target_ids"]
    lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

    outputs = self(
        input_ids=batch["source_ids"],
        attention_mask=batch["source_mask"],
        lm_labels=lm_labels,
        decoder_attention_mask=batch['target_mask']
    )
    loss = outputs[0]
    return loss

The problem is that in that way, the model learns to always return the majority class. That’s why I tried to implement another CrossEntropyLoss using the weights:

##### IMBALANCE LEARNING
    class_weights = torch.FloatTensor(self.hparams.class_weights).cuda()
    loss_fct = CrossEntropyLoss(ignore_index=-100, weight=class_weights)
    loss = loss_fct(logits, lm_labels)
    return loss

I hope I made it clearer now.

Marcos

spartan97 · February 9, 2021, 8:17pm

Hi @BramVanroy,

IMPORTANT UPDATE:

I have trying different things. First of all, I have checked that the loss produced by the model is the same as the CrossEntropy one without class_weights:

  def _step(self, batch):
    lm_labels = batch["target_ids"]
    lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

    outputs = self(
        input_ids=batch["source_ids"],
        attention_mask=batch["source_mask"],
        lm_labels=lm_labels,
        decoder_attention_mask=batch['target_mask']
    )
    print("Loss 1", outputs[0])
    logits = outputs[1]
    loss_fct = CrossEntropyLoss(ignore_index=-100)
    loss = loss_fct( logits.view(-1, logits.size(-1)), lm_labels.view(-1))
    print("Loss 2", loss)
    return loss

This works, perfectly. However, when I introduce class_weights like this:

  def _step(self, batch):
    lm_labels = batch["target_ids"]
    lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

    outputs = self(
        input_ids=batch["source_ids"],
        attention_mask=batch["source_mask"],
        lm_labels=lm_labels,
        decoder_attention_mask=batch['target_mask']
    )
    print("Loss 1", outputs[0])
    logits = outputs[1]
    class_weights = torch.FloatTensor([1, 7.8]).cuda()
    loss_fct = CrossEntropyLoss(ignore_index=-100, weight=class_weights)
    loss = loss_fct( logits.view(-1, logits.size(-1)), lm_labels.view(-1))
    print("Loss 2", loss)
    return loss

I get the following error:

RuntimeError: weight tensor should be defined either for all 32128 classes or no classes but got weight tensor of shape: [2]

Why is this happening? Since it is a binary classification problem.

Best,
Marcos

BramVanroy · February 9, 2021, 10:06pm

What’s the shape of logits?

spartan97 · February 9, 2021, 10:09pm

The logits have (8, 2, 32128) shape, and the labels (8,2). After applying the view and size functions the final shapes are (16, 32128) for the logits and (16, 2) for the labels:

logits.view(-1, logits.size(-1)), lm_labels.view(-1)

Any toughts?

BramVanroy · February 10, 2021, 11:48am

What kind of model are you using? The criterion expects a tensor of probabilities (x batch size). So it seems like you do not have a final classifier layer at the top or that you pass the arguments incorrectly.

EDIT: I was wrong. We are talking about T5 which sees every problem as text-to-text.

spartan97 · February 10, 2021, 12:10pm

Hi @BramVanroy,

I am using T5ForConditionalGeneration. The batch size is 8, and the vocabulary size is 32128.

What do you mean by a final layer? Like a softmax? I assumed that CrossEntropyLoss did that internally.

FYI, when I predict the probs for the test examples I use the following code:

logits = logits.squeeze(1)
selected_logits = logits[:, [12153, 2024]] 
print(selected_logits)
probs = torch.nn.Softmax(selected_logits)

Do you mean something like this?
Marcos

BramVanroy · February 10, 2021, 2:24pm

I was wrong in my earlier statements. I was not taking into account that this is about T5 which formulates every problem as a text-to-text problem where the output labels are indeed “text” as taken from the vocab.

I am not sure how you can use weighted cross entropy loss here because the labels are not necessarily just one token (which would be easy). I’ll let @valhalla take this one.

But please do not “topic hijack” other topics (T5 user defined loss function - #14 by peggy).

spartan97 · February 10, 2021, 4:08pm

Hi @BramVanroy,

Yes, that’s the option… It does not work as a normal BERT model for example.
I think I managed to solved it. I created a class weight tensor with the same size as the vocabulary, and I filled it with zeros except the positions that encode the labels (words ‘incorrect’, and ‘correct’ in this case):

class_weights = np.zeros(logits.shape[-1])
    class_weights[12153] = 7.48
    class_weights[2024] = 1
    class_weights_t = torch.from_numpy(class_weights).float().cuda()
    loss_fct = CrossEntropyLoss(ignore_index=-100, weight=class_weights_t)
    loss = loss_fct(logits.view(-1, logits.size(-1)), lm_labels.view(-1))

Sorry for the “topic hijack”, I am new to this kind of forums, and I just wanted a response to my problem hahaha

Thanks,
Marcos

BramVanroy · February 10, 2021, 4:26pm

Yes, that is what I said that is an option but that does not work well for longer options. If one token corresponds to one option, then you can do this - but if you want your code to work dynamically (also with larger labels that are tokenised with multi tokens), then this won’t work.

By the way, using torch.zeros(logits.shape[-1], device="cuda") will probably be faster.

I am curious to see whether @valhalla knows how to deal with this issue in a better, general way.

spartan97 · February 10, 2021, 4:29pm

Yes, you are right. In this case , I was lucky since the labels are composed by single words.
Thanks for the cuda trick
Yeah, a more general solution will be fantastic. Anyway, thanks for your kind attention

songs1 · June 30, 2023, 4:16am

I’ve implemented a custom loss function for the general case that weights each individual loss value per token based on the class of the sample which generated the specific individual loss value.

The below example is for a simple Yes/No QA task. The model is a GPT2LMHeadModel but I don’t see why it wouldn’t work for any text output. I’m curious to see if others have thoughts on this, hopefully it can help someone else!

    for step, batch in enumerate(tqdm(train_dataloader)):
        batch = {k: v.to("cuda") for k, v in batch.items()}
        no_mask = batch["No"] # simple boolean mask of shape (batch_size,)
        del batch["No"]
        with torch.cuda.amp.autocast():
            outputs = model(**batch)

            # Weighted loss
            labels = batch["labels"]
            logits = outputs["logits"]
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            loss_fct = torch.nn.CrossEntropyLoss(reduction='none') # get individual loss values
            raw_loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
            raw_loss = raw_loss.view(shift_labels.shape) # reshape to (batch_size, seq_length)
            weights = (1 - no_mask) * yes_weight + no_mask * no_weight # no weight is scaled by relative tokenized length
            weights = weights.repeat(raw_loss.shape[-1], 1).T # creates weights of shape (batch_size, seq_length) where weights for a given sample a repeated to seq_length
            weighted_loss = raw_loss * weights
            loss_mask = shift_labels != ignore_index
            loss_masked = torch.masked_select(weighted_loss, loss_mask)
            weights_masked = torch.masked_select(weights, loss_mask)
            loss = loss_masked.sum() / weights_masked.sum() # mean reduce loss with weights as in CELoss

        loss.backward()

(credit for the neat loss reduction lines from this post on the pytorch forum)

Topic		Replies	Views
Unable to train the model with weighted cross entropy Beginners	0	566	March 1, 2024
Custom loss weight for train a different weight for validation 🤗Transformers	0	201	April 4, 2024
Class weights for bertForSequenceClassification Beginners	10	12761	May 29, 2022
How can I use class_weights when training? 🤗Transformers	19	30820	December 29, 2022
Training with class weights 🤗Transformers	5	3001	November 18, 2023

Cross Entropy Weighted

Related topics