Thanks a lot @joeddav for the extra clarification!
@joeddav - i am getting below error after installing transformer
from transformers import pipeline
classifier = pipeline(âzero-shot-classificationâ)
KeyError: âUnknown task zero-shot-classification, available tasks are [âfeature-extractionâ, âsentiment-analysisâ, ânerâ, âquestion-answeringâ, âfill-maskâ, âsummarizationâ, âtranslation_en_to_frâ, âtranslation_en_to_deâ, âtranslation_en_to_roâ, âtext-generationâ]â
@joeddav
I have below 2 queries related to zero-shot text classification.
- can we train with own data on zero-shot text classification ? if yes please quote an example
- what is the difference between sentiment-analysis and zero-shot text classification and which one is better to use ?
Hey can you check which version of transformers your using. If you update using the code provided on the colab notebook it should work.
-
Yes you can. The idea is to train the model as an NLI task. Then you can pass this custom model that you have trained into the pipeline.
-
You can also do sentiment analysis using the zero shot text classification pipeline. But if you have sufficient data and the domain your targeting for sentiment analysis is pretty niche, you could train a transformer (or any other model for that matter) based on the data you have.
Hope this makes sense and is helpful.
Thanks for the helpful answer, @rsk97. Let me just add a bit:
-
I discuss this briefly in my blog post under
Classification as Natural Language Inference -> When Some Annotated Data is Available
. In short, if you have a limited amount of labeled data, you can further fine-tune the pre-trained NLI model. Pass the true label for a given sequence in the same way as you would during inference, e.g.<cls> Who are you voting for in 2020 ? <sep> This text is about politics . <sep>
, and calculate the loss as if you were doing NLI with the true label set toentailment
. You should also pass an equal number of sequences with a randomly selected false label, such as<cls> Who are you voting for in 2020 ? <sep> This text is about sports . <sep>
. For these fictitious example, the target label should be set tocontradiction
. This method will work a little bit if you have a small amount of labeled data, but it will really excel if you have a large amount of data for some of your labels and only a small amount of data (or no data) for other labels. -
We also have a bunch of ready-trained sentiment classifiers in the model hub. Use one of those out of the box, or fine-tune it further on your particular dataset.
This brings up a good point that the zero shot classification pipeline should only be used in the absence of labeled data or when fine-tuning a model is not feasible. If you have a good set of labeled data and you are able to fine-tune a model, you should fine-tune a model. You will get better performance and at a lower computational cost.
Hey @joeddav
I am using the model to classify a bunch of tweets. Before reading about the pipeline, I was using the process as shared on your blog post. For some reason when I try the same method using the pipeline using the same model and tokenizer - facebook/bart-large-mnli, I notice that the results are faster, which helps a lot since I have to use this to classify more than 100K tweets but at the same time, the resulting confidence scores are quite different. Can you help me understand what is the difference in the two approaches which might be causing this, thanks!
The probability distribution is different owing to the way it is calculated. Like mentioned in few posts above, as long as you keep multi_class = False, the softmax is calculated across the scores coming in entailment for all the classes. Whereas when not using the pipeline the way softmax is calculated would be different.
The speed variations seem interesting. Could you please share some numbers associated and the specifications of the machine youâre using for it?
Hope this answer made sense.@kmehra
Also, while running some experiments, I could see that the model was sensitive to case. I got two different classes as outputs when I used âHeyâ vs âheyâ. Is there something that Iâm missing or is this an expected behaviour?
This makes sense. However, I am referring to the case where I am running the pipeline as multi_class=True over the candidate labels. In this case, I notice the probability distribution is different compared to the case of individually running the model over the candidate labels and tweet (sequence) pairs. Is there something Iâm missing or is this expected behavior?
Another thing I might need help with is addressing the following questions to get better results:
-
To make the process faster, is it advised to move to a smaller, less heavier model than bart-large-mnli or is there another way to make it faster to process 100K tweets while keeping the accuracy high by using bart-large-mnli?
-
It is interesting to note that the model is sensitive to case. Accordingly, for my case would it be better to run the model with preprocessed tweets as input (stopword removal, case folding, etc) or should the raw tweet work better?
Thanks @rsk97
Yes, makes sense. Even for cases when multi_class= True, the probability of each classes are calculated independently on just entailment vs contradiction (neutral is ignored, if Iâm right).
For the questions
- Well one definite way to make it faster is to move to a smaller model. To maintain the exact accuracy and still making it faster is tricky. Couple of options which could be explored are - parallelism, quantization of the layers etc
- Iâm assuming youâll need pre-processing as otherwise certain symbols in conjunction with words might get categorised as a different token. But youâll have to ensure you use a pre-processing steps which take into account the nature of the tweet, cause the hastags, tagging etc are core parts of the tweet. You wouldnât wanna miss out on that.
But on performance side I guess you could benchmark by experimenting on a smaller data (pre-processed vs raw) and check it out.
Let me know if this looks correct.
So does that mean the probabilities on running the model individually (w/o the pipeline) does not ignore neutral? Iâm trying to understand what might be better for my use case and what creates the difference between the 2 cases so I can account for it.
As for the questions:
-
Will try and look into parallelism. Any particular way (or resources) you suggest? Is there a parameter that I could use like n_jobs or something?
-
The basic pre-processing is being done regardless such as removing certain symbols (# and @). The question is whether to run it through other steps such as stopword removal, case folding, lemmatization, etc. However, the benchmark way looks interesting and doable. Thanks @rsk97!
@kmehra can you post the code snippet both using the pipeline and your own code following the blog post? If youâre using the same model and have multi_class=True
, the result should be the same.
Also:
- Case sensitivity is normal. This is just a tokenization choice on the part of the model creators. I donât think thereâs any reason to worry about it, but if you want to the results to be the same you can just
.lower()
everything yourself before you send it ot the model. - Assuming youâre using the same model, the pipeline is likely faster because it batches the inputs. If you pass a single sequence with 4 labels, you have an effective batch size of 4, and the pipeline will pass these through the model in a single pass.
- The pipeline does ignore
neutral
and also ignorescontradiction
whenmulti_class=False
.
Sure, thanks for the help @joeddav
Sharing the code snippet below running on an example tweet.
TERMS - List of candidate labels
HYPOTHESES = ['This text is about '+x for x in TERMS] (List of labels in the proper template way)
BartTokenizer.from_pretrained('facebook/bart-large-mnli')
model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
classifier = pipeline(task='zero-shot-classification', model=model, tokenizer=tokenizer, framework='pt')
-
Using the model w/o the pipeline:
-
Using the pipeline:
âââMethod to get the labels for a tweet based on threshold specifiedâââ
def get_labels_pipeline(tweet, threshold=THRESHOLD):
topics = []
results = classifier(tweet, TERMS, multi_class=True)
for idx, score in enumerate(results['scores']):
score = score*100
if score>=threshold:
topics.append((results['labels'][idx], np.round(score, 2)))
return topics
Example:
Text = âWest Bengal calls for Indian Armyâs support to restore essential infrastructure, services after Cyclone Amphan havoc CycloneAmphan Amphan AmphanUpdatesâ
W/o the pipeline - get_labels(text, threshold=50)
[(âresource availabilityâ, 50.59), (ârelief measuresâ, 85.47), (âinfrastructureâ, 80.32), (ârescueâ, 66.81), (ânews updatesâ, 93.95), (âgrievanceâ, 79.94)]
With the pipeline - get_labels_pipeline(text, threshold=50)
[(âinfrastructureâ, 98.93), (ârelief measuresâ, 95.18), (âgrievanceâ, 92.81), (ânews updatesâ, 83.83), (âpower supplyâ, 80.1), (âutilitiesâ, 76.64), (âsympathyâ, 75.98), (âwater supplyâ, 73.14), (ârescueâ, 70.47)]
Thanks for the help again!
The only difference I see at a glance is that the hypotheses in your manual example doesnât have a period at the end while the pipelineâs default template does. Frustrating, but Iâve found omitting that period does have an impact. Lmk if that solves it, if not Iâll take a closer look.
@joeddav I am having the same issue that @hanman is having, no zero-shot available. Using transformers 3.0.2
Version: 3.0.2
Summary: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch
Home-page: https://github.com/huggingface/transformers
Author: Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Patrick von Platen, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors
Author-email: thomas@huggingface.co
License: Apache
Location: /opt/conda/lib/python3.7/site-packages
Requires: sentencepiece, regex, filelock, numpy, packaging, tokenizers, sacremoses, tqdm, requests
Required-by:
Thanks
Zero shot hasnât made it to release yet. You can find it on master. https://github.com/huggingface/transformers/blob/5ab21b072fa2a122da930386381d23f95de06e28/src/transformers/pipelines.py#L982
@joeddav Iâm running into the same issue as hanman and @colinferguson. I am also using transformers 3.0.2.
Would appreciate any advise.