Datasets - Streaming Output to Arrow?

Is there any easy way to stream output to make a data pipeline with Datasets?

The use case is, I’d like to read a HF dataset, run every row through an embedding model, and save the embeddings to disk as a HF dataset.

I wrap the HF Dataset in a Torch Dataloader so I can buffer it with multiple threads and avoid data starving the GPU. Then there’s a Torch inference loop over minibatches that are run through the Sentence Transformers model on the GPU.

The best solution I can see at the moment is to rewrite ds.save_to_disk so it runs the Torch inference loop before saving to disk. I can easily create a pyarrow.parquet.ParquetWriter to save batches from inference to disk each iteration, but it won’t have the metadata files / convenient sharding for HF. Another method is to store all the Torch tensors in memory and join them with the HF dataset afterwards, which doesn’t work when it doesn’t fit in memory.

Is there a built-in feature or an easier method that I’m missing to accomplish this?

1 Like

Hi ! you can use Dataset.from_generator() :slight_smile:

def pipeline():
    for batch in dataloader:
        for output_example in f(batch):
            yield output_example

ds = Dataset.from_generator(pipeline)
# ds.save_to_disk(...)
# or
# ds.push_to_hub(...)
1 Like

Awesome, thanks @lhoestq! That did turn out to be simple…

2 Likes

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.