How to pass a pipeline over a dataset with multiple columns

surya-narayanan · September 2, 2023, 9:48pm

Hi,

I want to pass a pipeline over a dataset, and want to do so efficiently. I can’t use a KeyDataset or a KeyPairDataset, since I need three columns of this dataset. Specifically, the dataset is ‘cais/mmlu’.

Does anyone have any suggestions?

mariosasko · September 6, 2023, 1:09pm

Hi! This should work:

dataset = dataset.select_columns(list_of_columns_to_pass_to_pipeline)
for out in pipeline(iter(dataset)):
    ...

surya-narayanan · September 6, 2023, 6:35pm

but is this batched?

surya-narayanan · September 6, 2023, 6:39pm

Also, I get the warning while iterating over the dataset that I ‘initialize the pipeline too many times’. I doubt this will solve that problem. Ultimately, that’s the problem I’m trying to solve.

mariosasko · September 6, 2023, 10:46pm

You can pass batch_size=<batch_size> to the pipeline to make it batched (as explained here).

Can you share the error message? You should pass a generator (e.g., iter(dataset)) to the pipeline to handle the iteration instead of doing it yourself.

Topic		Replies	Views
Best way to pass multiple pipelines over the same dataset Models	0	165	September 6, 2023
Connecting Dataset object to Multimodal pipelines Beginners	0	316	June 3, 2023
Best way to use multiple pipelines in conjunction on a single dataset? Beginners	0	397	June 27, 2022
Error Iterating over KeyDataset 🤗Datasets	0	30	August 30, 2024
What's the best way to speed up inference on a large dataset? Beginners	3	3905	March 13, 2022

How to pass a pipeline over a dataset with multiple columns

Related topics