New pipeline for zero-shot text classification

So does that mean the probabilities on running the model individually (w/o the pipeline) does not ignore neutral? I’m trying to understand what might be better for my use case and what creates the difference between the 2 cases so I can account for it.
As for the questions:

  1. Will try and look into parallelism. Any particular way (or resources) you suggest? Is there a parameter that I could use like n_jobs or something?

  2. The basic pre-processing is being done regardless such as removing certain symbols (# and @). The question is whether to run it through other steps such as stopword removal, case folding, lemmatization, etc. However, the benchmark way looks interesting and doable. Thanks @rsk97!

@kmehra can you post the code snippet both using the pipeline and your own code following the blog post? If you’re using the same model and have multi_class=True, the result should be the same.

Also:

  • Case sensitivity is normal. This is just a tokenization choice on the part of the model creators. I don’t think there’s any reason to worry about it, but if you want to the results to be the same you can just .lower() everything yourself before you send it ot the model.
  • Assuming you’re using the same model, the pipeline is likely faster because it batches the inputs. If you pass a single sequence with 4 labels, you have an effective batch size of 4, and the pipeline will pass these through the model in a single pass.
  • The pipeline does ignore neutral and also ignores contradiction when multi_class=False.
1 Like

Sure, thanks for the help @joeddav
Sharing the code snippet below running on an example tweet.

TERMS - List of candidate labels
HYPOTHESES = ['This text is about '+x for x in TERMS] (List of labels in the proper template way)

BartTokenizer.from_pretrained('facebook/bart-large-mnli')
model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
classifier = pipeline(task='zero-shot-classification', model=model, tokenizer=tokenizer, framework='pt')
  1. Using the model w/o the pipeline:

  2. Using the pipeline:
    ‘’‘Method to get the labels for a tweet based on threshold specified’‘’
    def get_labels_pipeline(tweet, threshold=THRESHOLD):
    topics = []
    results = classifier(tweet, TERMS, multi_class=True)
    for idx, score in enumerate(results['scores']):
    score = score*100
    if score>=threshold:
    topics.append((results['labels'][idx], np.round(score, 2)))
    return topics

Example:

Text = ‘West Bengal calls for Indian Army’s support to restore essential infrastructure, services after Cyclone Amphan havoc CycloneAmphan Amphan AmphanUpdates’

W/o the pipeline - get_labels(text, threshold=50)
[(‘resource availability’, 50.59), (‘relief measures’, 85.47), (‘infrastructure’, 80.32), (‘rescue’, 66.81), (‘news updates’, 93.95), (‘grievance’, 79.94)]

With the pipeline - get_labels_pipeline(text, threshold=50)
[(‘infrastructure’, 98.93), (‘relief measures’, 95.18), (‘grievance’, 92.81), (‘news updates’, 83.83), (‘power supply’, 80.1), (‘utilities’, 76.64), (‘sympathy’, 75.98), (‘water supply’, 73.14), (‘rescue’, 70.47)]

Thanks for the help again!

1 Like

The only difference I see at a glance is that the hypotheses in your manual example doesn’t have a period at the end while the pipeline’s default template does. Frustrating, but I’ve found omitting that period does have an impact. Lmk if that solves it, if not I’ll take a closer look.

2 Likes

Ahh, that fixed it! Thanks a lot @joeddav

1 Like

@joeddav I am having the same issue that @hanman is having, no zero-shot available. Using transformers 3.0.2

Version: 3.0.2
Summary: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch
Home-page: https://github.com/huggingface/transformers
Author: Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Patrick von Platen, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors
Author-email: thomas@huggingface.co
License: Apache
Location: /opt/conda/lib/python3.7/site-packages
Requires: sentencepiece, regex, filelock, numpy, packaging, tokenizers, sacremoses, tqdm, requests
Required-by: 

Thanks

2 Likes

Zero shot hasn’t made it to release yet. You can find it on master. https://github.com/huggingface/transformers/blob/5ab21b072fa2a122da930386381d23f95de06e28/src/transformers/pipelines.py#L982

2 Likes

@joeddav I’m running into the same issue as hanman and @colinferguson. I am also using transformers 3.0.2.

Would appreciate any advise.

@nedai 3.1.0 was officially released today, so just upgrade transformers and you should be good.

Hi, thanks for this wonderful demo! I have been running the zero shot pipeline for my use case by passing each text and it’s corresponding list of hypotheses labels, however it takes around 3 hours on a GPU to classify the ~22000 sentences(each sentence can have a varying number of labels ranging from 70 to 140 labels).
Is there a way to reduce the computation time by passing the embeddings to the sequence classification model directly rather than the raw list of labels?

1 Like

@akshatap19 yeah that # of labels is tough. Assuming you just mean the sentence encodings rather than the actual word embeddings, yes, that might give you a small boost. You’d have to work with the model manually rather than with pipelines tho (example here). Also, try just using BartTokenizerFast rather than BartTokenizer and passing that to the pipeline factory. Additionally:

  • Use mixed precision. This is pretty easy if using PyTorch 1.6.
  • Use ONNX. Tutorial notebook here. One of our contributors, @valhalla, also has a project that wraps our pipelines with ONNX runtime accelaration built in. See if that gives you a boost.
1 Like

Thanks Joe for sharing the project!

@akshatap19
Let me know if you try onnx_transformers. Would love to hear your feedback. :slight_smile: Currently zero-shot-classification pipeline supports roberta-large-mnli instead of BART as it’s not yet tested in ONNX. If people find the project useful I’ll start adding more models and remaining pipelines.

4 Likes

@akshatap19 Me too, If you are able to implement the model for a large number of sentences quicker. Please ping or comment back. Currently for me processing a single sentence with about 10-12 labels takes almost 10 secs. Would love to have a faster approach. Thank you.

Could you guys share one example here with 10-12 lables, I’ll try to benchmark with onnx and see if it’s faster than normal pytorch.

1 Like

Would be great to use an approach similar to sentence transformers where embedding of corpus is separated from embedding a query.

In this case, if we could first embed the labels and then separately a single sequence and then as the next step - compare those embeddings. That would probably speed things up.

1 Like

I take it that’s on CPU?

Yes, that is what I am planning to do after researching and trying out the onnx pipelines. Let me know if you have tried it or have a few resources to make for a simpler implementation.

I tried it on colab using a GPU

Heres one example: text:‘When it comes to product strategy–we are stuck in implementation, not strategy.’
List of labels:[‘specific’, ‘area’,‘segment’, ‘survey’, ‘commercial’, ‘company’, ‘effort’, ‘business’, ‘function’, ‘solution’, ‘development’, ‘product’, ‘approach’, ‘leader’, ‘team’, ‘deliver’,‘go_to_market’, ‘accelerate’, ‘develop’, ‘platform’,‘understanding’, ‘system’,‘tech’,‘marketing’, ‘new_product’, ‘technology’, ‘enterprise’, ‘digital’,‘organization’,‘segmentation’, ‘leadership’, ‘enterprise_segmentation’, ‘implementation’, ‘marketing_organization’,‘customer_experience’, ‘partner_management’,‘health_services’,‘strategy’, ‘strategic’]

Make sure you’re constructing it with pipeline('zero-shot-classification', device=0). Even 100 labels shouldn’t take more than a few seconds on GPU.