Hi,
I want to pass a pipeline over a dataset, and want to do so efficiently. I can’t use a KeyDataset or a KeyPairDataset, since I need three columns of this dataset. Specifically, the dataset is ‘cais/mmlu’.
Does anyone have any suggestions?
Hi,
I want to pass a pipeline over a dataset, and want to do so efficiently. I can’t use a KeyDataset or a KeyPairDataset, since I need three columns of this dataset. Specifically, the dataset is ‘cais/mmlu’.
Does anyone have any suggestions?
Hi! This should work:
dataset = dataset.select_columns(list_of_columns_to_pass_to_pipeline)
for out in pipeline(iter(dataset)):
...
but is this batched?
Also, I get the warning while iterating over the dataset that I ‘initialize the pipeline too many times’. I doubt this will solve that problem. Ultimately, that’s the problem I’m trying to solve.
You can pass batch_size=<batch_size>
to the pipeline to make it batched (as explained here).
Can you share the error message? You should pass a generator (e.g., iter(dataset)
) to the pipeline to handle the iteration instead of doing it yourself.