Multi-label token classification

Hi!

I am trying to solve a token classification problem in a multi-label setup. So far I haven’t found the best path to do it.
Unlike AutoModelForSequenceClassification, AutoModelForTokenClassification lacks the problem_type parameter, where you can specify that you’re working with multi-label.
Should I deepen into logits and create a custom loss or is there a more straightforward solution?

Thanks!

2 Likes

I would also love to be able to do this. I’m trying to do some NER that has multiple layers of annotation such that a given token could have more than one label…

Actually, I already started implementing a custom loss within the compute_loss function of the Trainer…

…but the thing surely deserves a feature request on GitHub! Will try to do that

1 Like

Oh nice. Do you by any chance have code you could share? :wink:

I don’t think it should be much harder than swapping out CrossEntropyLoss for BCEWithLogitsLoss and one-hot encoding your labels and making sure that they are floats.

2 Likes

Yeah, that’s roughly what I am doing. Will share the code once it’s done!

1 Like

@BunnyNoBugs Any update on this? :pleading_face: I have been working on this myself but am running into some challenges…

This is an untested attempt, but I think it should work. Read more about BCEWithLogitsLoss here.

class MultiLabelTrainer(Trainer):
    def __init__(self, *args, class_weights: Optional[FloatTensor] = None, **kwargs):
        super().__init__(*args, **kwargs)
        if class_weights is not None:
            class_weights = class_weights.to(self.args.device)
            logging.info(f"Using multi-label classification with class weights", class_weights)
        self.loss_fct = BCEWithLogitsLoss(weight=class_weights)

    def compute_loss(self, model, inputs, return_outputs=False):
        """
        How the loss is computed by Trainer. By default, all models return the loss in the first element.
        Subclass and override for custom behavior.
        """
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        try:
            loss = self.loss_fct(outputs.logits.view(-1, model.num_labels), labels.view(-1))
        except AttributeError:  # DataParallel
            loss = self.loss_fct(outputs.logits.view(-1, model.module.num_labels), labels.view(-1))

        return (loss, outputs) if return_outputs else loss

Make sure your data is correctly formatted, e.g., 0 or 1 encoded for each label for each sample.

1 Like

Yep, I did the same but without view and getting an error with the dims. Will get back to this task on Thursday

Thanks so much @BunnyNoBugs and @BramVanroy! I’m trying this now and getting some other issues but I think they are specific to my situation. I may report back later with an update if I think it will be useful to others. Thanks again!

Here’s a question: doesn’t the custom loss function need to ignore the predictions for special tokens like CLS, PAD, SEP, and (if one is only applying labels to the first subword in a token) the non-first subwords?

Thanks this was super helpful! I followed your tips and got dimension errors, but fixed it by modifying the view parameters, and I also had to cast the labels to floats:

loss = self.loss_fct(outputs.logits.view(-1, model.num_labels), labels.view(-1, model.num_labels).float())

I need to look at my results more closely but I think I’ve got this working. My model only gets to about 60% F1 but on the validation set but I think I have ideas about how to improve it. :imp:

class MultiLabelNERTrainer(Trainer):
    def __init__(self, *args, class_weights: Optional[FloatTensor] = None, **kwargs):
        super().__init__(*args, **kwargs)
        if class_weights is not None:
            class_weights = class_weights.to(self.args.device)
            logging.info(f"Using multi-label classification with class weights", class_weights)
        self.loss_fct = BCEWithLogitsLoss(weight=class_weights)

    def compute_loss(self, model, inputs, return_outputs=False):
        """
        How the loss is computed by Trainer. By default, all models return the loss in the first element.
        Subclass and override for custom behavior.
        """
        labels  = inputs.pop("labels")
        outputs = model(**inputs)
        
        # this accesses predictions for tokens that aren't CLS, PAD, or the 2nd+ subword in a word
        # and simultaneously flattens the logits or labels
        flat_outputs = outputs.logits[labels!=-100] 
        flat_labels  = labels[ labels!=-100]
        
        try:
            loss = self.loss_fct(flat_outputs, flat_labels)
        except AttributeError:  # DataParallel
            loss = self.loss_fct(flat_outputs, flat_labels)

        return (loss, outputs) if return_outputs else loss
2 Likes

It is said, for example, here, that -100 value is automatically ignored by PyTorch loss functions.

That’s only true for CrossEntropyLoss AFAIK. You can compare the signatures. CrossEntropyLoss has an “ignore_index” option, BCEWithLogitsLoss does not. So I think that @drussellmrichie 's adaptation is indeed needed.

2 Likes

I haven’t finished my loss function yet, but here’s what I found out:

  1. inputs.pop('labels') in @BramVanroy’s example is very important: otherwise the labels are passed to the model and the standard CrossEntropyLoss inside it is computed instead of the custom one.

  2. We should make use of the attention_mask in the custom loss. However, it may be the same as @drussellmrichie’s suggestion about the special tokens, maybe someone will correct me.

  1. We should make use of the attention_mask in the custom loss. However, it may be the same as @drussellmrichie’s suggestion about the special tokens, maybe someone will correct me.

I’m fairly confident that it’s not necessary given the chunk I added with labels!=-100.

EDIT: Just to elaborate: the proof is in the pudding, and indeed I was able to train some decent multi-label NER models with this approach. :slight_smile:

1 Like

What do you mean with padding?

Are you referring to my phrase “the proof is in the pudding”? If so, I apologize – I am using an old proverb. :wink: I am just saying that the proof that labels!=100 is doing something similar to attention mask lies in the fact that the model performs decently and generates reasonable predictions for the tokens we care about (not CLS, PAD, etc). Also, I noticed that the loss went way up after I introduced labels!=100, I think because now the easy predictions of the CLS, PAD, and 2nd+ subword tokens can no longer drive the loss down – that is consistent with @BramVanroy 's comment that BCEWithLogitsLoss does not ignore predictions for special tokens, and needed something like my labels!=100 trick to ignore those.

Hope that clarifies.

2 Likes

Hi @drussellmrichie (and @BunnyNoBugs and @BramVanroy !), thanks for the sample code above. A couple questions if you don’t mind:

  • Is the custom trainer all that’s needed to adapt the model to multi-label classification?
  • Also, is this being implemented on top of AutoModelForTokenClassification?
  • Would you be willing to share an example notebook? I’d love to see the full implementation, including the “right” way to extract the predictions at inference time.

If anyone would be willing to share an example notebook, I’m currently trying to get this working and would very much appreciate a sample implementation!

Thanks!