New pipeline for zero-shot text classification

kmehra · August 26, 2020, 8:59am

So does that mean the probabilities on running the model individually (w/o the pipeline) does not ignore neutral? I’m trying to understand what might be better for my use case and what creates the difference between the 2 cases so I can account for it.
As for the questions:

Will try and look into parallelism. Any particular way (or resources) you suggest? Is there a parameter that I could use like n_jobs or something?
The basic pre-processing is being done regardless such as removing certain symbols (# and @). The question is whether to run it through other steps such as stopword removal, case folding, lemmatization, etc. However, the benchmark way looks interesting and doable. Thanks @rsk97!

joeddav · August 26, 2020, 4:15pm

@kmehra can you post the code snippet both using the pipeline and your own code following the blog post? If you’re using the same model and have multi_class=True, the result should be the same.

Also:

Case sensitivity is normal. This is just a tokenization choice on the part of the model creators. I don’t think there’s any reason to worry about it, but if you want to the results to be the same you can just .lower() everything yourself before you send it ot the model.
Assuming you’re using the same model, the pipeline is likely faster because it batches the inputs. If you pass a single sequence with 4 labels, you have an effective batch size of 4, and the pipeline will pass these through the model in a single pass.
The pipeline does ignore neutral and also ignores contradiction when multi_class=False.

kmehra · August 26, 2020, 5:17pm

Sure, thanks for the help @joeddav
Sharing the code snippet below running on an example tweet.

TERMS - List of candidate labels
HYPOTHESES = ['This text is about '+x for x in TERMS] (List of labels in the proper template way)

BartTokenizer.from_pretrained('facebook/bart-large-mnli')
model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
classifier = pipeline(task='zero-shot-classification', model=model, tokenizer=tokenizer, framework='pt')

Using the model w/o the pipeline:

image758×440 21.2 KB
Using the pipeline:
‘’‘Method to get the labels for a tweet based on threshold specified’‘’
def get_labels_pipeline(tweet, threshold=THRESHOLD):
topics = []
results = classifier(tweet, TERMS, multi_class=True)
for idx, score in enumerate(results['scores']):
score = score*100
if score>=threshold:
topics.append((results['labels'][idx], np.round(score, 2)))
return topics

Example:

Text = ‘West Bengal calls for Indian Army’s support to restore essential infrastructure, services after Cyclone Amphan havoc CycloneAmphan Amphan AmphanUpdates’

W/o the pipeline - get_labels(text, threshold=50)
[(‘resource availability’, 50.59), (‘relief measures’, 85.47), (‘infrastructure’, 80.32), (‘rescue’, 66.81), (‘news updates’, 93.95), (‘grievance’, 79.94)]

With the pipeline - get_labels_pipeline(text, threshold=50)
[(‘infrastructure’, 98.93), (‘relief measures’, 95.18), (‘grievance’, 92.81), (‘news updates’, 83.83), (‘power supply’, 80.1), (‘utilities’, 76.64), (‘sympathy’, 75.98), (‘water supply’, 73.14), (‘rescue’, 70.47)]

Thanks for the help again!

joeddav · August 26, 2020, 6:31pm

The only difference I see at a glance is that the hypotheses in your manual example doesn’t have a period at the end while the pipeline’s default template does. Frustrating, but I’ve found omitting that period does have an impact. Lmk if that solves it, if not I’ll take a closer look.

kmehra · August 27, 2020, 5:14am

Ahh, that fixed it! Thanks a lot @joeddav

colinferguson · August 28, 2020, 5:38pm

@joeddav I am having the same issue that @hanman is having, no zero-shot available. Using transformers 3.0.2

Version: 3.0.2
Summary: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch
Home-page: https://github.com/huggingface/transformers
Author: Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Patrick von Platen, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors
Author-email: thomas@huggingface.co
License: Apache
Location: /opt/conda/lib/python3.7/site-packages
Requires: sentencepiece, regex, filelock, numpy, packaging, tokenizers, sacremoses, tqdm, requests
Required-by:

Thanks

bdalal · August 28, 2020, 9:09pm

Zero shot hasn’t made it to release yet. You can find it on master. https://github.com/huggingface/transformers/blob/5ab21b072fa2a122da930386381d23f95de06e28/src/transformers/pipelines.py#L982

nedai · September 1, 2020, 5:50pm

@joeddav I’m running into the same issue as hanman and @colinferguson. I am also using transformers 3.0.2.

Would appreciate any advise.

joeddav · September 1, 2020, 6:03pm

@nedai 3.1.0 was officially released today, so just upgrade transformers and you should be good.

akshatap19 · September 10, 2020, 3:42pm

Hi, thanks for this wonderful demo! I have been running the zero shot pipeline for my use case by passing each text and it’s corresponding list of hypotheses labels, however it takes around 3 hours on a GPU to classify the ~22000 sentences(each sentence can have a varying number of labels ranging from 70 to 140 labels).
Is there a way to reduce the computation time by passing the embeddings to the sequence classification model directly rather than the raw list of labels?

joeddav · September 10, 2020, 5:47pm

@akshatap19 yeah that # of labels is tough. Assuming you just mean the sentence encodings rather than the actual word embeddings, yes, that might give you a small boost. You’d have to work with the model manually rather than with pipelines tho (example here). Also, try just using BartTokenizerFast rather than BartTokenizer and passing that to the pipeline factory. Additionally:

Use mixed precision. This is pretty easy if using PyTorch 1.6.
Use ONNX. Tutorial notebook here. One of our contributors, @valhalla, also has a project that wraps our pipelines with ONNX runtime accelaration built in. See if that gives you a boost.

valhalla · September 11, 2020, 11:36am

Thanks Joe for sharing the project!

@akshatap19
Let me know if you try onnx_transformers. Would love to hear your feedback. Currently zero-shot-classification pipeline supports roberta-large-mnli instead of BART as it’s not yet tested in ONNX. If people find the project useful I’ll start adding more models and remaining pipelines.

m2rik · September 14, 2020, 4:19am

@akshatap19 Me too, If you are able to implement the model for a large number of sentences quicker. Please ping or comment back. Currently for me processing a single sentence with about 10-12 labels takes almost 10 secs. Would love to have a faster approach. Thank you.

valhalla · September 14, 2020, 6:34am

Could you guys share one example here with 10-12 lables, I’ll try to benchmark with onnx and see if it’s faster than normal pytorch.

endrus · September 14, 2020, 11:16am

Would be great to use an approach similar to sentence transformers where embedding of corpus is separated from embedding a query.

In this case, if we could first embed the labels and then separately a single sequence and then as the next step - compare those embeddings. That would probably speed things up.

joeddav · September 14, 2020, 5:11pm

I take it that’s on CPU?

akshatap19 · September 14, 2020, 5:30pm

Yes, that is what I am planning to do after researching and trying out the onnx pipelines. Let me know if you have tried it or have a few resources to make for a simpler implementation.

m2rik · September 14, 2020, 5:40pm

I tried it on colab using a GPU

akshatap19 · September 14, 2020, 5:43pm

Heres one example: text:‘When it comes to product strategy–we are stuck in implementation, not strategy.’
List of labels:[‘specific’, ‘area’,‘segment’, ‘survey’, ‘commercial’, ‘company’, ‘effort’, ‘business’, ‘function’, ‘solution’, ‘development’, ‘product’, ‘approach’, ‘leader’, ‘team’, ‘deliver’,‘go_to_market’, ‘accelerate’, ‘develop’, ‘platform’,‘understanding’, ‘system’,‘tech’,‘marketing’, ‘new_product’, ‘technology’, ‘enterprise’, ‘digital’,‘organization’,‘segmentation’, ‘leadership’, ‘enterprise_segmentation’, ‘implementation’, ‘marketing_organization’,‘customer_experience’, ‘partner_management’,‘health_services’,‘strategy’, ‘strategic’]

joeddav · September 14, 2020, 5:57pm

Make sure you’re constructing it with pipeline('zero-shot-classification', device=0). Even 100 labels shouldn’t take more than a few seconds on GPU.

Topic		Replies	Views
Zero shot classification with manual pytorch Beginners	0	719	August 27, 2021
Project: Create a new zero-shot model with NLI data 🤗 Course Projects	9	3649	April 11, 2023
Zero shot classification pipeline customization Intermediate	2	1748	April 27, 2022
Fine tune Zero-shot classification on multi-label dataset Models	4	3538	November 30, 2023
Model for Text Classification similar to bart-large-mnli, for TensorFlow Beginners	0	494	May 6, 2022

New pipeline for zero-shot text classification

Related topics