New pipeline for zero-shot text classification

So, If every seq/label pair has to be fed in the model separately, does that mean the prediction scores are independent for each class when passing multiple labels? Would the results be significantly different if K = 100 candidate labels and one loops through them one at a time providing only one candidate_label during each prediction/inference, and then just aggregating the results for all the 100 prediction runs, versus passing them all together in a single list by setting multi_class = True.

My document size is usually large > 20 sentences and I have around K=100 classes, so passing them all at together time results in CUDA OOM error, so I have been trying out a couple of options.

Option 1:
I break them up into sentences and then pass K=100 classes all together, with multi_class=True (works)
Option 2:
I loop through K classes, and in each loop I pass in the whole document and just make prediction on a single class. At the end of the loop I’ll have prediction for all the 100 classes which I can aggregate and compare.

Based on what you mentioned earlier I am thinking Option 1 & Option 2 should result in similar results? Would that be a correct assumption? I tested this on some sample docs and results at times are pretty close, just don’t know if thats always the case or that was just chance.

@joeddav Any comments on zero-shot classification performance for long text (For example, news articles / financial reports / transcripts)?

I do see lot of high scores (> 0.9) when multi_class = True for list of custom tags …

Yes, Option 2 if you’re doing multi_class=True, then passing your K labels separately as smaller subsets of candidate_labels (or one by one) should yield the same result. I think Option 1 is different – should work, but it’s different.

By the way, it’s not very hard to implement zero-shot classification without relying on the pipeline if you want more control. I did this in a recent mini-project of mine so that I could do multi-gpu and more efficient batching than the pipeline currently supports. You can reference how I did that here. You basically just need to format each label with the hypothesis template, feed each seq/label pair through the NLI model, and then normalize the NLI model logits.

For long documents, I don’t think there’s an ideal solution right now. If truncation isn’t satisfactory, then the best thing you can do is probably split the document into smaller segments and ensemble the scores somehow.

I do see lot of high scores (> 0.9) when multi_class = True for list of custom tags …

Yeah unfortunately this will just happen sometimes :man_shrugging: It’s the reason why multi_class=False is recommended when possible. It’s a lot easier to tell which one of K labels is the correct label rather than independently predicting each label based on the class name alone, as you do when multi_class=True. You might have to just try out a bunch of examples and see what threshold works best. It’s just a really hard problem to tell whether the class name y applies to the sentence x without any training data or additional context. So far this method is the best I’ve encountered, but hopefully we can improve with time.

Sure Joe.
May i ask if any interpretation pipeline can be linked to the predictions? Is that something in the pipeline?
Any resources you would suggest for this?

Thanks for all the info and you’re amazing post @joeddav. I have a question on fine-tuning the model using a limited amount of training data for a small number of labels (let’s say 10).

When doing the fine tuning and optimising for cross entropy, how do we make sure the model isn’t overfitting on this limited number of examples and forgetting what it has already learned (the pre training)?

In @joeddav’s blog post, it is said that " One problem that arises after fine-tuning is that the model predicts a much higher probability for labels it has seen than for those it has not. To mitigate this issue, the authors introduce a procedure that penalizes labels at test time which were seen at training time. See the paper for full details."

I don’t think you need to worry too much about catastrophic forgetting, though as @nielsr pointed out you will end up with higher predicted probabilities for those 10 labels than for any novel labels seen after that. If you’re worried about it, you can always continue training the model on MNLI while you fine-tune it for your classification task!

I have some surveys - around 25000 of them and the task that I want to do is classify each survey into categories . Each survey has around 50-100 words in them, not more The maximum categories that I have in the dataset is not more than 10-12. Is there a sample code that I can refer to, so that I can train/use this zero shot classifier to tag these unlabelled input data to 1 of the categories ? This is kind of similar to topic modeling but I do not get good results with LDA so I want to try this method out and see if works (like a charm!) @joeddav @valhalla your help is much appreciated here.

Hi, is there any new way to use multi gpu inference. I am working in sagemaker. I have lot of data and want to use multiple gpus to do presiction.

1 Like

Hello, Is there a way to increase the max number of chars that can be inputted? Right now we are seeing that the max is 1000 chars.

Thank you,

HK

Hope you dont mind me posting my blog post here, but I did zero shot classification with sentence transformers. This are the results.

While I probably haven’t done the extensive tests that HF has done, it has a significant advantage in speed. If you want me to do a PR, happy to do so but would need some guidance.

Hi @valhalla.

Do you plan to update your library onnx_transformers?

If not, how can we use transformers.onnx with HF models other than with language models without head (see Exporting transformers models), like for example, a QA HF model with pipelines (or without)?

Thanks.

I think this is really cool for generating synthetic data. We’ve actually used GPT-3 for generating summaries and I have the plan on using those summaries as a basis for training a new summarization model on the Longformer base model. I could easily someone speedrunning their classification models by first going to a zero shot classifier. It’s like a new semi supervised technique!

Hi @joeddav ,

I have a dataset with one example(text) for each class, so wanted to know how can i tune the ZSL model to learn from those examples per each class. I am trying to figure out how to make this ZSL to one shot learning.

Thanks & Regards,
Rajat

Looks like there is no development anymore on it?

Thanks for sharing this!

Do you have recommends for what form of label make most sense in this case?

While running through toy examples, I noticed different “form” of label could result in different “classification” (entailment) probability. For example,**

classifier = pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-1")
labels = ['Food and Drink', 'food and drink', 'food & drink', 'Food & Drink', 'Foods & Drinks', 'technology']
classifier('this about bagel and cream cheese. what is your favorite flavor cream cheese?', labels, multi_label=True)

{‘sequence’: ‘this about bagel and cream cheese. what is your favorite flavor cream cheese?’,
‘labels’: [‘food & drink’, ‘food and drink’, ‘Food & Drink’, ‘Food and Drink’, ‘Foods & Drinks’, ‘technology’],
‘scores’: [0.6025, 0.4726, 0.1834, 0.1197, 0.01321, 0.000253]}

– this is something I can implement my self but would be nice to have

  • option to disable returning of text sequence in the classifier output
  • option to enforce consistent order of label in the output (e.g. as provided in input)
1 Like

Hi, thanks for sharing the demo. I was wondering if there is an automatic sentence tokenization happening in the pipeline or do you recommend to give single sentences as the input to the pipeline. I have some paragraphs of a few sentences and am not sure if I have to split the sentences before passing them to the classifier or if it will be taken care of in the pipeline.
Thank you in advance.

@joeddav maybe I missed this within this long thread of posts, but is there an example on how to further fine-tune the models? (as you suggest above in option 1). I’m using the model joeddav/xlm-roberta-large-xnli, and I get about 50% accuracy as it stands, but I’d like to use a few labeled text I have to try to make it better. I’m somewhat new to hugging face / torch, so a code snippet to get me started would go a long way :slight_smile:

Check this out for doing zero-shot distillation.

The whole point of zero-shot is that you don’t do fine-tuning. If you do fine-tuning, then it isn’t zero-shot anymore :wink:

1 Like