I have a pretty large dataset dataset(2 milliom records), which consists of 2 columns:
- Text(up to 3-4 words, usually short)
- Labels for prediction(up to 3-4 words as well)
What I want to do is to apply pretrained Roberta model for zero-shot classification. Here is the way I did it:
#convert pandas to dataset:
dataset = Dataset.from_pandas(data)
#loading model:
model = AutoModelForSequenceClassification.from_pretrained('joeddav/xlm-roberta-large-xnli')
tokenizer = AutoTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli')
classifier = pipeline("zero-shot-classification", model= model,tokenizer = tokenizer ,framework = 'pt')
hypothesis_template = "Im Text geht es um {}"
#define prediction function and apply it to the dataset
def prediction(record,classifier):
hypothesis_template = "Im Text geht es um {}"
output = classifier(record['text'],record['label'],hypothesis_template=hypothesis_template)
record['prediction'] = output['labels'][0]
record['scores'] = output['scores'][0]
return record
dataset.map(lambda x: prediction(x,classifier=classifier))
But I am not sure if it’s the most efficient way for inference(unfortunately it process approximately 2-3 records per second, which is too slow). Official page (Pipelines) says, that I should avoid batching if I am using CPU. But still my questions are :
- Is pipeline wrapper pipeline fast enough or should stick to ‘lower level’(for example: native Pytorch)?
- Is inference though
.map
considered a good practice? If not, what should be used instead? - Having relative short text(maximum 5-6 words) should batching be used instead of one record at a time?