Hi,
UPDATE: notebook to reproduce: https://colab.research.google.com/drive/1t-ApjHqdSo90NoXSJ5baeh7h-gJx8bLt?usp=sharing
I have a large amount of unlabeled texts, stored as a Pandas dataframe. So just a single column called “text”.
I’d like to apply zero-shot classification on all these texts in a batched way using HuggingFace Datasets’ .map(function, batched=True)
functionality. I defined the function that I want to apply on batches as follows:
def zero_shot_classify_sequences(examples, threshold=0.5):
# first, send batch of texts through pipeline
texts = examples['text']
outputs = classifier(texts, candidate_labels, multi_label=True)
# next, for each output:
final_outputs = []
for output in outputs:
# create dictionary (predicted_labels, confidence)
final_output = {}
for label, score in zip(output['labels'], output['scores']):
if score > threshold:
final_output[label] = score
final_outputs.append(final_output)
assert len(final_outputs) == len(texts)
# set final outputs
examples['predicted_labels'] = final_outputs
return examples
The candidate labels
are defined outside of this function.
In other words, I’d like to add a new column “predicted_labels”, which, for a batch of texts, should be a list of dictionaries (each dictionary mapping labels to confidence values for a given text - only those for which the confidence value > 0.5). However, when I do updated_dataset = dataset.map(zero_shot_classify_sequences, batched=True, batch_size=10)
, the output does not look like I’d expect. For a given text, I get the following:
'predicted_labels': {'Delivery & fulfilment technology': None,
'Novel processing techniques & Equipments': None,
'Plant-based': None,
'Retail tech': None}
This should not be the case. In case none of the confidence values is higher than the threshold of 0.5, then the dictionary of “predicted labels” should be empty for that given example.
It probably has to do with the fact that a list of dictionaries is not supported by Apache Arrow? Or is it?