New pipeline for zero-shot text classification

Thanks a lot @joeddav for the extra clarification!

1 Like

@joeddav - i am getting below error after installing transformer

from transformers import pipeline
classifier = pipeline(“zero-shot-classification”)
KeyError: “Unknown task zero-shot-classification, available tasks are [‘feature-extraction’, ‘sentiment-analysis’, ‘ner’, ‘question-answering’, ‘fill-mask’, ‘summarization’, ‘translation_en_to_fr’, ‘translation_en_to_de’, ‘translation_en_to_ro’, ‘text-generation’]”

2 Likes

@joeddav
I have below 2 queries related to zero-shot text classification.

  1. can we train with own data on zero-shot text classification ? if yes please quote an example
  2. what is the difference between sentiment-analysis and zero-shot text classification and which one is better to use ?

Hey can you check which version of transformers your using. If you update using the code provided on the colab notebook it should work.

  1. Yes you can. The idea is to train the model as an NLI task. Then you can pass this custom model that you have trained into the pipeline.

  2. You can also do sentiment analysis using the zero shot text classification pipeline. But if you have sufficient data and the domain your targeting for sentiment analysis is pretty niche, you could train a transformer (or any other model for that matter) based on the data you have.

Hope this makes sense and is helpful.

2 Likes

Thanks for the helpful answer, @rsk97. Let me just add a bit:

  1. I discuss this briefly in my blog post under Classification as Natural Language Inference -> When Some Annotated Data is Available. In short, if you have a limited amount of labeled data, you can further fine-tune the pre-trained NLI model. Pass the true label for a given sequence in the same way as you would during inference, e.g. <cls> Who are you voting for in 2020 ? <sep> This text is about politics . <sep>, and calculate the loss as if you were doing NLI with the true label set to entailment. You should also pass an equal number of sequences with a randomly selected false label, such as <cls> Who are you voting for in 2020 ? <sep> This text is about sports . <sep>. For these fictitious example, the target label should be set to contradiction. This method will work a little bit if you have a small amount of labeled data, but it will really excel if you have a large amount of data for some of your labels and only a small amount of data (or no data) for other labels.

  2. We also have a bunch of ready-trained sentiment classifiers in the model hub. Use one of those out of the box, or fine-tune it further on your particular dataset.

This brings up a good point that the zero shot classification pipeline should only be used in the absence of labeled data or when fine-tuning a model is not feasible. If you have a good set of labeled data and you are able to fine-tune a model, you should fine-tune a model. You will get better performance and at a lower computational cost.

5 Likes

Hey @joeddav
I am using the model to classify a bunch of tweets. Before reading about the pipeline, I was using the process as shared on your blog post. For some reason when I try the same method using the pipeline using the same model and tokenizer - facebook/bart-large-mnli, I notice that the results are faster, which helps a lot since I have to use this to classify more than 100K tweets but at the same time, the resulting confidence scores are quite different. Can you help me understand what is the difference in the two approaches which might be causing this, thanks!

The probability distribution is different owing to the way it is calculated. Like mentioned in few posts above, as long as you keep multi_class = False, the softmax is calculated across the scores coming in entailment for all the classes. Whereas when not using the pipeline the way softmax is calculated would be different.

The speed variations seem interesting. Could you please share some numbers associated and the specifications of the machine you’re using for it?

Hope this answer made sense.@kmehra

Also, while running some experiments, I could see that the model was sensitive to case. I got two different classes as outputs when I used “Hey” vs “hey”. Is there something that I’m missing or is this an expected behaviour?

This makes sense. However, I am referring to the case where I am running the pipeline as multi_class=True over the candidate labels. In this case, I notice the probability distribution is different compared to the case of individually running the model over the candidate labels and tweet (sequence) pairs. Is there something I’m missing or is this expected behavior?

Another thing I might need help with is addressing the following questions to get better results:

  1. To make the process faster, is it advised to move to a smaller, less heavier model than bart-large-mnli or is there another way to make it faster to process 100K tweets while keeping the accuracy high by using bart-large-mnli?

  2. It is interesting to note that the model is sensitive to case. Accordingly, for my case would it be better to run the model with preprocessed tweets as input (stopword removal, case folding, etc) or should the raw tweet work better?

Thanks @rsk97

Yes, makes sense. Even for cases when multi_class= True, the probability of each classes are calculated independently on just entailment vs contradiction (neutral is ignored, if I’m right).
For the questions

  1. Well one definite way to make it faster is to move to a smaller model. To maintain the exact accuracy and still making it faster is tricky. Couple of options which could be explored are - parallelism, quantization of the layers etc
  2. I’m assuming you’ll need pre-processing as otherwise certain symbols in conjunction with words might get categorised as a different token. But you’ll have to ensure you use a pre-processing steps which take into account the nature of the tweet, cause the hastags, tagging etc are core parts of the tweet. You wouldn’t wanna miss out on that.
    But on performance side I guess you could benchmark by experimenting on a smaller data (pre-processed vs raw) and check it out.

Let me know if this looks correct.

So does that mean the probabilities on running the model individually (w/o the pipeline) does not ignore neutral? I’m trying to understand what might be better for my use case and what creates the difference between the 2 cases so I can account for it.
As for the questions:

  1. Will try and look into parallelism. Any particular way (or resources) you suggest? Is there a parameter that I could use like n_jobs or something?

  2. The basic pre-processing is being done regardless such as removing certain symbols (# and @). The question is whether to run it through other steps such as stopword removal, case folding, lemmatization, etc. However, the benchmark way looks interesting and doable. Thanks @rsk97!

@kmehra can you post the code snippet both using the pipeline and your own code following the blog post? If you’re using the same model and have multi_class=True, the result should be the same.

Also:

  • Case sensitivity is normal. This is just a tokenization choice on the part of the model creators. I don’t think there’s any reason to worry about it, but if you want to the results to be the same you can just .lower() everything yourself before you send it ot the model.
  • Assuming you’re using the same model, the pipeline is likely faster because it batches the inputs. If you pass a single sequence with 4 labels, you have an effective batch size of 4, and the pipeline will pass these through the model in a single pass.
  • The pipeline does ignore neutral and also ignores contradiction when multi_class=False.
1 Like

Sure, thanks for the help @joeddav
Sharing the code snippet below running on an example tweet.

TERMS - List of candidate labels
HYPOTHESES = ['This text is about '+x for x in TERMS] (List of labels in the proper template way)

BartTokenizer.from_pretrained('facebook/bart-large-mnli')
model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
classifier = pipeline(task='zero-shot-classification', model=model, tokenizer=tokenizer, framework='pt')
  1. Using the model w/o the pipeline:

  2. Using the pipeline:
    ‘’‘Method to get the labels for a tweet based on threshold specified’‘’
    def get_labels_pipeline(tweet, threshold=THRESHOLD):
    topics = []
    results = classifier(tweet, TERMS, multi_class=True)
    for idx, score in enumerate(results['scores']):
    score = score*100
    if score>=threshold:
    topics.append((results['labels'][idx], np.round(score, 2)))
    return topics

Example:

Text = ‘West Bengal calls for Indian Army’s support to restore essential infrastructure, services after Cyclone Amphan havoc CycloneAmphan Amphan AmphanUpdates’

W/o the pipeline - get_labels(text, threshold=50)
[(‘resource availability’, 50.59), (‘relief measures’, 85.47), (‘infrastructure’, 80.32), (‘rescue’, 66.81), (‘news updates’, 93.95), (‘grievance’, 79.94)]

With the pipeline - get_labels_pipeline(text, threshold=50)
[(‘infrastructure’, 98.93), (‘relief measures’, 95.18), (‘grievance’, 92.81), (‘news updates’, 83.83), (‘power supply’, 80.1), (‘utilities’, 76.64), (‘sympathy’, 75.98), (‘water supply’, 73.14), (‘rescue’, 70.47)]

Thanks for the help again!

1 Like

The only difference I see at a glance is that the hypotheses in your manual example doesn’t have a period at the end while the pipeline’s default template does. Frustrating, but I’ve found omitting that period does have an impact. Lmk if that solves it, if not I’ll take a closer look.

2 Likes

Ahh, that fixed it! Thanks a lot @joeddav

1 Like

@joeddav I am having the same issue that @hanman is having, no zero-shot available. Using transformers 3.0.2

Version: 3.0.2
Summary: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch
Home-page: https://github.com/huggingface/transformers
Author: Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Patrick von Platen, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors
Author-email: thomas@huggingface.co
License: Apache
Location: /opt/conda/lib/python3.7/site-packages
Requires: sentencepiece, regex, filelock, numpy, packaging, tokenizers, sacremoses, tqdm, requests
Required-by: 

Thanks

2 Likes

Zero shot hasn’t made it to release yet. You can find it on master. https://github.com/huggingface/transformers/blob/5ab21b072fa2a122da930386381d23f95de06e28/src/transformers/pipelines.py#L982

2 Likes

@joeddav I’m running into the same issue as hanman and @colinferguson. I am also using transformers 3.0.2.

Would appreciate any advise.

@nedai 3.1.0 was officially released today, so just upgrade transformers and you should be good.