How to pass a pipeline over a dataset with multiple columns

Hi,

I want to pass a pipeline over a dataset, and want to do so efficiently. I can’t use a KeyDataset or a KeyPairDataset, since I need three columns of this dataset. Specifically, the dataset is ‘cais/mmlu’.

Does anyone have any suggestions?

Hi! This should work:

dataset = dataset.select_columns(list_of_columns_to_pass_to_pipeline)
for out in pipeline(iter(dataset)):
    ...

but is this batched?

Also, I get the warning while iterating over the dataset that I ‘initialize the pipeline too many times’. I doubt this will solve that problem. Ultimately, that’s the problem I’m trying to solve.

You can pass batch_size=<batch_size> to the pipeline to make it batched (as explained here).

Can you share the error message? You should pass a generator (e.g., iter(dataset)) to the pipeline to handle the iteration instead of doing it yourself.