Cannot stream custom dataset

Hey all,

I’m stuck writing a custom dataset for loading .safetensor files (I need to store a lot of vectors in fp16, and parquet does not support that). So, I would like to have the following:

I would like to download my entire custom dataset from huggingface and then use it as an IterableDataset to avoid processing millions of rows. I can get this to work with local files, and its super fast:

from datasets import IterableDataset
from safetensors.torch import load_file


def generate_sample(files):
    for file in files:
        dataset = load_file(file)
        n_rows = len(dataset["query_ids"])

        for i in range(n_rows):
            yield {
                "query_id": dataset["query_ids"][i],
                "features": dataset["features"][i],
                "label": dataset["labels"][i],
            }


if __name__ == "__main__":
    files = [f"part-0_split-{i}.safetensors" for i in range(10)]
    dataset = IterableDataset.from_generator(generate_sample, gen_kwargs={"files": files})
    loader = DataLoader(dataset, batch_size=16, num_workers=4)

    for batch in loader:
        pass

However, I would like to automatically load the dataset from my repository. It seems like the only way to make this work is to stream the dataset, however, this is for some weird reason super slow. And if I call load_data and then immediately call .to_iterable_dataset(), it tries to preprocess/generate my entire training data, which is exactly what I want to avoid…

# This is very slow
dataset = load_data("philipphager/baidu-ultr-590k", split="train").to_iterable_dataset()

Can somebody help me out? My custom dataset builder is here: https://huggingface.co/datasets/philipphager/baidu-ultr-590k/blob/main/baidu-ultr-590k.py

Also, when I call streaming, my _generate_examples does not get local file paths anymore, but URLs of files that should have been donwloaded before by the DownloadManager. This does not happen when not streaming…

The script yields 1GB vectors, so streaming is expected to be slow. Yielding smaller vector slices (if possible) will improve the performance.

Also, when I call streaming, my _generate_examples does not get local file paths anymore, but URLs of files that should have been donwloaded before by the DownloadManager. This does not happen when not streaming…

dl_manager.download_and_extract returns a (fsspec) URL to avoid downloading the data locally in the streaming mode.