The pipeline can use any model trained on an NLI task, by default bart-large-mnli. It works by posing each candidate label as a âhypothesisâ and the sequence which we want to classify as the âpremiseâ. In the first example in the gif above, the model would be fed,
<cls> Who are you voting for in 2020 ? <sep> This example is politics. <sep>
and likewise for each candidate label. Itâs therefore important to keep in mind that each candidate label requires its own forward pass. In the single-label case we take the scores for entailment as logits and put them through a softmax such that the candidate label scores add to 1. When multi_class=True is passed, we instead softmax the scores for entailment vs. contradiction for each candidate label independently.
You can also change the hypothesis template. As shown in the formatted example above, the default template is This example is {}. This seems to work well in general, but you may be able to improve results by tailoring it to your specific setting (discussed in the example notebook).
Also feel free to check out the blog post I wrote on zero shot a few months back, and our live demo which uses this method for zero-shot topic classification. I hope you find it useful!
Edit: FYI, you will get a big speedup by using this on GPU. You can do this by passing device=0 where 0 is the device number, to the pipeline factory:
Thank you for providing this pipeline, blog and collab notebook. I read your blog twice (once during ACL :)) and the associated paper (relevant part) to refresh. I have a few questions:
What happens before and after multi_class=True exactly ?.
So letâs say I trained bert-base-uncased on MNLI. The last linear layer outputs a scaler tensor with three values (one for entailment, contradiction and neutral).
How is the last linear layer being handled since number of candidate_labels are variable and its not being trained?
In the paper its mentioned that once entailment is predicted, we can take it as a prediction and (you specify in the collab notebook) that label applies to the text. I didnât catch this part properly. Could you please explain ? Iâm not sure exactly what happens when entailment is predicted. Do you take standard softmax over all the labels ?
When we pass multi_class=True, do you just output the confidence scores (from linear layer directly) ?
What does hypothesis_template actually does ? Does it help in computing better hidden representation ?
When we use this pipeline, we are using a model trained on MNLI, including the last layer which predicts one of three labels: contradiction, neutral, and entailment. Since we have a list of candidate labels, each sequence/label pair is fed through the model as a premise/hypothesis pair, and we get out the logits for these three categories for each label. So for a single sequence we end up with a matrix of logits of shape (num_candidate_labels, 3).
When multi_class=False, we do a softmax of the entailment logits over all the candidate labels, i.e. logits[:,-1].softmax(dim=0). This gives a probability for each label such that they sum to one.
When multi_class=True, we do a softmax over entailment vs contradiction for each candidate label independently, i.e. logits[:,[0,-1]].softmax(dim=1)[:,-1]. This gives a probability for each candidate label between 0 and 1, but they are independent and do not sum to 1.
As for the hypothesis template, it is a template that formats a candidate label as a sequence. So if youâre about to pass a candidate label of politics through the model and you have the default hypothesis template of This example is about {}., the model would be fed This example is about politics. as the hypothesis.
Thanks for this helpful reply. I wanted to know one more thing, if I were to perform ZSL on NLI task. For instance letâs say I use bert-base trained on MNLI, I changed the labels now, for instance ['True', ' False', 'neither']. The model was explicitly trained on NLI task, I want to evaluate it on some other NLI task (so different distribution, similar structure), will it still constitute ZSL? . The reason why Iâm asking is because the classification labels are changed now, but the model is still being asked the same thing. Do you have any ideas on how it can be used for testing generalization in a similar way or is the method I mentioned is fine ?
I donât think that would be considered ZSL, just generalization. Thereâs room for debate here, but in general itâs not zero shot unless youâre evaluating it on a task that is meaningfully different in some way from the one it was trained on. NLI -> topic classification is a pretty big shift, MNLI -> SNLI is much less so.
I was perhaps being too simple when I said "it predicts one of three labels: contradiction , neutral , and entailment". The NLI classifier outputs a distribution over three values. Those values are logits corresponding to the labels contradiction , neutral , and entailment, but the model doesnât see those words, we just know them. So thereâs not really a way to âchangeâ the labels. The output of the NLI model is just a vector.
Hey @joeddav, not sure if this is the right forum to raise this. But how can I replace the default RoBERTa/BART model with a custom model in the classifier pipeline?
@rsk97 In addition, just make sure the model used is trained on an NLI task and that the last output label corresponds to entailment while the first output label corresponds to contradiction.
Yes you can. The idea is to train the model as an NLI task. Then you can pass this custom model that you have trained into the pipeline.
You can also do sentiment analysis using the zero shot text classification pipeline. But if you have sufficient data and the domain your targeting for sentiment analysis is pretty niche, you could train a transformer (or any other model for that matter) based on the data you have.
Thanks for the helpful answer, @rsk97. Let me just add a bit:
I discuss this briefly in my blog post under Classification as Natural Language Inference -> When Some Annotated Data is Available. In short, if you have a limited amount of labeled data, you can further fine-tune the pre-trained NLI model. Pass the true label for a given sequence in the same way as you would during inference, e.g. <cls> Who are you voting for in 2020 ? <sep> This text is about politics . <sep>, and calculate the loss as if you were doing NLI with the true label set to entailment. You should also pass an equal number of sequences with a randomly selected false label, such as <cls> Who are you voting for in 2020 ? <sep> This text is about sports . <sep>. For these fictitious example, the target label should be set to contradiction. This method will work a little bit if you have a small amount of labeled data, but it will really excel if you have a large amount of data for some of your labels and only a small amount of data (or no data) for other labels.
We also have a bunch of ready-trained sentiment classifiers in the model hub. Use one of those out of the box, or fine-tune it further on your particular dataset.
This brings up a good point that the zero shot classification pipeline should only be used in the absence of labeled data or when fine-tuning a model is not feasible. If you have a good set of labeled data and you are able to fine-tune a model, you should fine-tune a model. You will get better performance and at a lower computational cost.
Hey @joeddav
I am using the model to classify a bunch of tweets. Before reading about the pipeline, I was using the process as shared on your blog post. For some reason when I try the same method using the pipeline using the same model and tokenizer - facebook/bart-large-mnli, I notice that the results are faster, which helps a lot since I have to use this to classify more than 100K tweets but at the same time, the resulting confidence scores are quite different. Can you help me understand what is the difference in the two approaches which might be causing this, thanks!
The probability distribution is different owing to the way it is calculated. Like mentioned in few posts above, as long as you keep multi_class = False, the softmax is calculated across the scores coming in entailment for all the classes. Whereas when not using the pipeline the way softmax is calculated would be different.
The speed variations seem interesting. Could you please share some numbers associated and the specifications of the machine youâre using for it?
Also, while running some experiments, I could see that the model was sensitive to case. I got two different classes as outputs when I used âHeyâ vs âheyâ. Is there something that Iâm missing or is this an expected behaviour?
This makes sense. However, I am referring to the case where I am running the pipeline as multi_class=True over the candidate labels. In this case, I notice the probability distribution is different compared to the case of individually running the model over the candidate labels and tweet (sequence) pairs. Is there something Iâm missing or is this expected behavior?
Another thing I might need help with is addressing the following questions to get better results:
To make the process faster, is it advised to move to a smaller, less heavier model than bart-large-mnli or is there another way to make it faster to process 100K tweets while keeping the accuracy high by using bart-large-mnli?
It is interesting to note that the model is sensitive to case. Accordingly, for my case would it be better to run the model with preprocessed tweets as input (stopword removal, case folding, etc) or should the raw tweet work better?
Yes, makes sense. Even for cases when multi_class= True, the probability of each classes are calculated independently on just entailment vs contradiction (neutral is ignored, if Iâm right).
For the questions
Well one definite way to make it faster is to move to a smaller model. To maintain the exact accuracy and still making it faster is tricky. Couple of options which could be explored are - parallelism, quantization of the layers etc
Iâm assuming youâll need pre-processing as otherwise certain symbols in conjunction with words might get categorised as a different token. But youâll have to ensure you use a pre-processing steps which take into account the nature of the tweet, cause the hastags, tagging etc are core parts of the tweet. You wouldnât wanna miss out on that.
But on performance side I guess you could benchmark by experimenting on a smaller data (pre-processed vs raw) and check it out.