Loading multiple serialized datasets with `multiprocessing`

carted-ml · March 25, 2022, 10:18am

If I have multiple serialized datasets then is it a good practice to use multiprocessing along with datasets.load_dataset()?

Something like so:

import multiprocessing
import datasets
import gcsfs


BUCKET_NAME = "my-bucket"
GCS_FS = gcsfs.GCSFileSystem()


def load_ds(ds_path):
    return datasets.load_from_disk(ds_path, fs=GCS_FS)


ds_dirs = GCS_FS.listdir(f"{BUCKET_NAME}/saved_datasets")
ds_dirs = list(
    {
        f"{dd['name']}"
        for dd in ds_dirs
        if "tokenized" in dd["name"]
    }
)

with multiprocessing.Pool() as pool:
    ds_list = pool.starmap_async(load_ds, ds_dirs).get()
    ds_list = [ds for ds in ds_list]

train_ds = datasets.concatenate_datasets(ds_list)
print(train_ds)

Running this snippet does not execute as expected though.

mariosasko · March 31, 2022, 12:22pm

Hi! Could you please explain why this snippet “does not execute as expected” or provide the error stack trace if there is one?

carted-ml · April 2, 2022, 5:30am

By "does not execute as expected”, I meant it does not execute at all.

Topic		Replies	Views
Load_dataset hangs with local files 🤗Datasets	6	4298	January 3, 2023
Multiprocessing and sharding when creating dataset from scratch using loading script 🤗Datasets	2	1626	November 4, 2022
DatasetDict save_to_disk with num_proc > 1 seems to hang with error 🤗Datasets	2	366	March 4, 2024
Streaming dataset freezes with multi-gpu 🤗Transformers	2	1677	December 8, 2022
Load shards as one dataset 🤗Datasets	0	183	February 16, 2024

Loading multiple serialized datasets with `multiprocessing`

Related topics