Multi-label token classification

BunnyNoBugs · April 5, 2022, 7:12pm

Hi!

I am trying to solve a token classification problem in a multi-label setup. So far I haven’t found the best path to do it.
Unlike AutoModelForSequenceClassification, AutoModelForTokenClassification lacks the problem_type parameter, where you can specify that you’re working with multi-label.
Should I deepen into logits and create a custom loss or is there a more straightforward solution?

Thanks!

drussellmrichie · April 11, 2022, 9:19pm

I would also love to be able to do this. I’m trying to do some NER that has multiple layers of annotation such that a given token could have more than one label…

BunnyNoBugs · April 11, 2022, 10:05pm

Actually, I already started implementing a custom loss within the compute_loss function of the Trainer…

…but the thing surely deserves a feature request on GitHub! Will try to do that

drussellmrichie · April 12, 2022, 4:15am

Oh nice. Do you by any chance have code you could share?

BramVanroy · April 12, 2022, 8:01am

I don’t think it should be much harder than swapping out CrossEntropyLoss for BCEWithLogitsLoss and one-hot encoding your labels and making sure that they are floats.

BunnyNoBugs · April 12, 2022, 8:28pm

Yeah, that’s roughly what I am doing. Will share the code once it’s done!

drussellmrichie · April 19, 2022, 1:26pm

@BunnyNoBugs Any update on this? I have been working on this myself but am running into some challenges…

BramVanroy · April 19, 2022, 1:38pm

This is an untested attempt, but I think it should work. Read more about BCEWithLogitsLoss here.

class MultiLabelTrainer(Trainer):
    def __init__(self, *args, class_weights: Optional[FloatTensor] = None, **kwargs):
        super().__init__(*args, **kwargs)
        if class_weights is not None:
            class_weights = class_weights.to(self.args.device)
            logging.info(f"Using multi-label classification with class weights", class_weights)
        self.loss_fct = BCEWithLogitsLoss(weight=class_weights)

    def compute_loss(self, model, inputs, return_outputs=False):
        """
        How the loss is computed by Trainer. By default, all models return the loss in the first element.
        Subclass and override for custom behavior.
        """
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        try:
            loss = self.loss_fct(outputs.logits.view(-1, model.num_labels), labels.view(-1))
        except AttributeError:  # DataParallel
            loss = self.loss_fct(outputs.logits.view(-1, model.module.num_labels), labels.view(-1))

        return (loss, outputs) if return_outputs else loss

Make sure your data is correctly formatted, e.g., 0 or 1 encoded for each label for each sample.

BunnyNoBugs · April 19, 2022, 1:58pm

Yep, I did the same but without view and getting an error with the dims. Will get back to this task on Thursday

drussellmrichie · April 19, 2022, 3:52pm

Thanks so much @BunnyNoBugs and @BramVanroy! I’m trying this now and getting some other issues but I think they are specific to my situation. I may report back later with an update if I think it will be useful to others. Thanks again!

drussellmrichie · April 20, 2022, 7:46pm

Here’s a question: doesn’t the custom loss function need to ignore the predictions for special tokens like CLS, PAD, SEP, and (if one is only applying labels to the first subword in a token) the non-first subwords?

aligator · April 21, 2022, 9:10am

Thanks this was super helpful! I followed your tips and got dimension errors, but fixed it by modifying the view parameters, and I also had to cast the labels to floats:

loss = self.loss_fct(outputs.logits.view(-1, model.num_labels), labels.view(-1, model.num_labels).float())

drussellmrichie · April 21, 2022, 1:59pm

I need to look at my results more closely but I think I’ve got this working. My model only gets to about 60% F1 but on the validation set but I think I have ideas about how to improve it.

class MultiLabelNERTrainer(Trainer):
    def __init__(self, *args, class_weights: Optional[FloatTensor] = None, **kwargs):
        super().__init__(*args, **kwargs)
        if class_weights is not None:
            class_weights = class_weights.to(self.args.device)
            logging.info(f"Using multi-label classification with class weights", class_weights)
        self.loss_fct = BCEWithLogitsLoss(weight=class_weights)

    def compute_loss(self, model, inputs, return_outputs=False):
        """
        How the loss is computed by Trainer. By default, all models return the loss in the first element.
        Subclass and override for custom behavior.
        """
        labels  = inputs.pop("labels")
        outputs = model(**inputs)
        
        # this accesses predictions for tokens that aren't CLS, PAD, or the 2nd+ subword in a word
        # and simultaneously flattens the logits or labels
        flat_outputs = outputs.logits[labels!=-100] 
        flat_labels  = labels[ labels!=-100]
        
        try:
            loss = self.loss_fct(flat_outputs, flat_labels)
        except AttributeError:  # DataParallel
            loss = self.loss_fct(flat_outputs, flat_labels)

        return (loss, outputs) if return_outputs else loss

BunnyNoBugs · April 21, 2022, 8:46pm

It is said, for example, here, that -100 value is automatically ignored by PyTorch loss functions.

BramVanroy · April 22, 2022, 8:41am

That’s only true for CrossEntropyLoss AFAIK. You can compare the signatures. CrossEntropyLoss has an “ignore_index” option, BCEWithLogitsLoss does not. So I think that @drussellmrichie 's adaptation is indeed needed.

BunnyNoBugs · April 22, 2022, 2:17pm

I haven’t finished my loss function yet, but here’s what I found out:

inputs.pop('labels') in @BramVanroy’s example is very important: otherwise the labels are passed to the model and the standard CrossEntropyLoss inside it is computed instead of the custom one.
We should make use of the attention_mask in the custom loss. However, it may be the same as @drussellmrichie’s suggestion about the special tokens, maybe someone will correct me.

drussellmrichie · April 22, 2022, 2:45pm

We should make use of the attention_mask in the custom loss. However, it may be the same as @drussellmrichie’s suggestion about the special tokens, maybe someone will correct me.

I’m fairly confident that it’s not necessary given the chunk I added with labels!=-100.

EDIT: Just to elaborate: the proof is in the pudding, and indeed I was able to train some decent multi-label NER models with this approach.

BunnyNoBugs · April 22, 2022, 6:36pm

What do you mean with padding?

drussellmrichie · April 22, 2022, 7:09pm

Are you referring to my phrase “the proof is in the pudding”? If so, I apologize – I am using an old proverb. I am just saying that the proof that labels!=100 is doing something similar to attention mask lies in the fact that the model performs decently and generates reasonable predictions for the tokens we care about (not CLS, PAD, etc). Also, I noticed that the loss went way up after I introduced labels!=100, I think because now the easy predictions of the CLS, PAD, and 2nd+ subword tokens can no longer drive the loss down – that is consistent with @BramVanroy 's comment that BCEWithLogitsLoss does not ignore predictions for special tokens, and needed something like my labels!=100 trick to ignore those.

Hope that clarifies.

murdockthedude · May 17, 2022, 5:31pm

Hi @drussellmrichie (and @BunnyNoBugs and @BramVanroy !), thanks for the sample code above. A couple questions if you don’t mind:

Is the custom trainer all that’s needed to adapt the model to multi-label classification?
Also, is this being implemented on top of AutoModelForTokenClassification?
Would you be willing to share an example notebook? I’d love to see the full implementation, including the “right” way to extract the predictions at inference time.

If anyone would be willing to share an example notebook, I’m currently trying to get this working and would very much appreciate a sample implementation!

Thanks!

Topic		Replies	Views
Multilabel text classification Trainer API Beginners	8	22529	August 2, 2023
Fine-Tune for MultiClass or MultiLabel-MultiClass Models	52	69551	May 22, 2023
Multi-label sequence labeling (for e.g., multi-label NER) 🤗Transformers	0	1553	November 21, 2022
Multi-label token classification: "-100" special label 🤗Transformers	1	508	September 18, 2023
Finetuning from multiclass to mutlilabel Intermediate	4	784	September 1, 2021

Multi-label token classification

Related topics