Batch processing for stream dataset

Zak · August 12, 2022, 8:29am

Hello,

I am working on the Oscar dataset and trying to filter some entries based on the results of the One-Shot classification model.

Since the Oscars dataset is huge and I can’t load at once, I’ m using the stream mode of Datasets.

To filter the dataset, I use this function

def infernce_batch(examples):

   outputs, scores, texts   = [], [], []

   for exemple in  exemples['text']:
       res = classifier_ort(exemple, classes)
       res_mean = np.mean(np.array(res['scores']))

       outputs.append(True if res_mean > 0.5 else False)
       scores.append(res_mean)
       texts.append(exemple)

   return {"text": texts, "is_class": outputs, "score": scores}

After that I call the MAP function to iterate throw the dataset

updated_dataset = oscar_dataset_streamed.map(infernce_batch, 
                                  batched=True, 
                                  batch_size=1000)

And to get back the results

for exemple in updated_dataset.take(100):

    print(exemple['id'])
    print(exemple['text'])

To me, this is not the most efficient way to process this dataset, as it is streaming data row by row and not by batch, and I didn’t see any difference between using the Map function with batched=True and the MAP function with batched=False.

So my question is, is there an efficient way to process the Oscar dataset faster using stream mode and batch processing?

Thank you

Topic		Replies	Views
Streaming batched data 🤗Datasets	4	3946	October 5, 2023
Streaming datasets and batched mapping 🤗Datasets	5	2698	January 10, 2022
Apply batched zero shot classification on HuggingFace datasets object 🤗Datasets	4	2424	April 9, 2021
Roadmap/timeline for dataset streaming 🤗Datasets	9	2286	July 5, 2021
Ideal batch_size and writer_batch_size for datasets 🤗Datasets	1	1715	December 9, 2022

Batch processing for stream dataset

Related topics