The most efficient way for predictions(zero-shot classification) on huge dataset

danilyef · July 6, 2022, 7:21am

I have a pretty large dataset dataset(2 milliom records), which consists of 2 columns:

Text(up to 3-4 words, usually short)
Labels for prediction(up to 3-4 words as well)

What I want to do is to apply pretrained Roberta model for zero-shot classification. Here is the way I did it:

#convert pandas to dataset:
dataset = Dataset.from_pandas(data)

#loading model:
model = AutoModelForSequenceClassification.from_pretrained('joeddav/xlm-roberta-large-xnli')
tokenizer = AutoTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli')

classifier = pipeline("zero-shot-classification", model= model,tokenizer = tokenizer ,framework = 'pt')
hypothesis_template = "Im Text geht es um {}"

#define prediction function and apply it to the dataset
def prediction(record,classifier):
    hypothesis_template = "Im Text geht es um {}"
    output = classifier(record['text'],record['label'],hypothesis_template=hypothesis_template)
    record['prediction'] = output['labels'][0]
    record['scores'] = output['scores'][0]
    return record


dataset.map(lambda x: prediction(x,classifier=classifier))

But I am not sure if it’s the most efficient way for inference(unfortunately it process approximately 2-3 records per second, which is too slow). Official page (Pipelines) says, that I should avoid batching if I am using CPU. But still my questions are :

Is pipeline wrapper pipeline fast enough or should stick to ‘lower level’(for example: native Pytorch)?
Is inference though .map considered a good practice? If not, what should be used instead?
Having relative short text(maximum 5-6 words) should batching be used instead of one record at a time?

Topic		Replies	Views
Speeding up zero shot classification [Solved] Beginners	5	5693	September 9, 2020
Inference using Pipeline and TensorFlow Beginners	0	487	December 2, 2021
Train large models on large datasets by parts Beginners	0	218	April 24, 2021
Dataset and Training Batching Beginners	1	1303	February 9, 2022
Unable to Load Model from disk and feed it to pipeline module 🤗Hub	1	1560	August 24, 2023

The most efficient way for predictions(zero-shot classification) on huge dataset

Related topics