New pipeline for zero-shot text classification

How’s it going?

I’m getting different entailment probabilities for when I use pipeline vs when I don’t use pipeline. Below code pertains to me not using pipeline

from transformers import AutoModelForSequenceClassification, AutoTokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')

premise = 'claude giroux played for the flyers'
hypothesis = 'hockey'

# run through model pre-trained on MNLI
x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
                     truncation_strategy='only_first')
logits = nli_model(x)[0]

# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true 
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)


print('probs not using pipeline:', probs)

and my probabilities are
[[0.2292, 0.7708]]
where 0.7708 corresponds to entailment

Below code uses pipeline

from transformers import pipeline

premise = 'claude giroux played for the flyers'
hypothesis = 'hockey'

classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

preds = classifier(premise, hypothesis, multi_label = False)

print('probs using pipeline:', preds)

and my entailment probability is [0.10335938632488251]

I also get the same probability above when multi_label is set to True

Is there a reason for the differing probabilities?

This is interesting idea but I find it too slow. Imagine if I have very long text and 100 different labels then it will encode text 100 times with different labels.

and If I have 10k long sentences then it will will be very very slow.

I like your first approach in the blog where we encode 10k sentences and labels only once and then get label using cosine similarity. That would be much faster.

What do you think?

Hi all,

I want to do zero-shot text classification for automotive parts, an candidate labels are around 3200. Now this needs to be done on basis of the Description.
Using pretrained Zero-Shot models like Bart-mnli etc, are not giving me good results, as they are not having much context knowledge. How can I finetune, ZeroShot model on all descriptions corpus, as I think this will improve results a lot. I saw this approach in ULMFit by Fastai, as they train the Language model encoder by training to predict the next word, for the whole corpus. And then they use that encoder as backbone for text classifier, and once that is fine-tuned, results are better.

Thanks, for any help. Kindly share me, any notebook, or blog, which can help me implement this…

I am currently doing inference using valhalla/distilbart-mnli-12-1 and with 30 possible candidate labels on about 70k datapoints. To get the label, I am currently using batching and running this.
for out in tqdm.tqdm(classifier(KeyDataset(train_dataset, “input”), candidate_labels=list_of_topics, batch_size = 256)):

Running on colabs GPU, it looks like it will take about 7 hours. Is this expected and is there anything I can do to speed it up?

Hi, I saw this statement and am curious if this is still true? The latency of the pipeline in our testing seems to scale linearly with the number of labels, but if I recreate a simplified version of it where I batch each sequence pair into a single forward pass it’s substantially faster. Has this automatic label-wise batching been removed from the pipeline since you originally posted this answer?

I am wondering how to best model the scenario where I want a binary classifier with the positive class having multiple labels (e.g, this article is about sports OR politics OR science). Should I use one meta-label (“sports or politics or science”) or should I use three separate labels and sum up the probabilities?