Batch processing for stream dataset

Hello,

I am working on the Oscar dataset and trying to filter some entries based on the results of the One-Shot classification model.

Since the Oscars dataset is huge and I can’t load at once, I’ m using the stream mode of Datasets.

To filter the dataset, I use this function

def infernce_batch(examples):

   outputs, scores, texts   = [], [], []

   for exemple in  exemples['text']:
       res = classifier_ort(exemple, classes)
       res_mean = np.mean(np.array(res['scores']))

       outputs.append(True if res_mean > 0.5 else False)
       scores.append(res_mean)
       texts.append(exemple)

   return {"text": texts, "is_class": outputs, "score": scores} 

After that I call the MAP function to iterate throw the dataset

updated_dataset = oscar_dataset_streamed.map(infernce_batch, 
                                  batched=True, 
                                  batch_size=1000)

And to get back the results

for exemple in updated_dataset.take(100):

    print(exemple['id'])
    print(exemple['text'])

To me, this is not the most efficient way to process this dataset, as it is streaming data row by row and not by batch, and I didn’t see any difference between using the Map function with batched=True and the MAP function with batched=False.

So my question is, is there an efficient way to process the Oscar dataset faster using stream mode and batch processing?

Thank you