Hello,
I am working on the Oscar dataset and trying to filter some entries based on the results of the One-Shot classification model.
Since the Oscars dataset is huge and I can’t load at once, I’ m using the stream mode of Datasets.
To filter the dataset, I use this function
def infernce_batch(examples):
outputs, scores, texts = [], [], []
for exemple in exemples['text']:
res = classifier_ort(exemple, classes)
res_mean = np.mean(np.array(res['scores']))
outputs.append(True if res_mean > 0.5 else False)
scores.append(res_mean)
texts.append(exemple)
return {"text": texts, "is_class": outputs, "score": scores}
After that I call the MAP
function to iterate throw the dataset
updated_dataset = oscar_dataset_streamed.map(infernce_batch,
batched=True,
batch_size=1000)
And to get back the results
for exemple in updated_dataset.take(100):
print(exemple['id'])
print(exemple['text'])
To me, this is not the most efficient way to process this dataset, as it is streaming data row by row and not by batch, and I didn’t see any difference between using the Map
function with batched=True
and the MAP
function with batched=False
.
So my question is, is there an efficient way to process the Oscar dataset faster using stream mode and batch processing?
Thank you