How to save datasets as distributed with save_to_disk?

fujindemi · November 7, 2022, 10:46am

I have a lot of data to save. Is it possible to save data into multiple files and load multiple files together?

lhoestq · November 15, 2022, 5:54pm

Hi ! Right now you have to shard the dataset yourself to save multiple files, but I’m working on supporting saving into multiple files, it will be available soon

In the meantime you can do:

ds = load_dataset(...)
num_shards = 32
for shard_idx in range(num_shards):
    shard = ds.shard(num_shards=num_shards, index=shard_idx, contiguous=True)
    shard.save_to_disk(f"path/to/shard_{shard_idx}")

# reload later
from datasets import load_from_disk, concatenate_datasets
ds = concatenate_datasets([
    load_from_disk(f"path/to/shard_{shard_idx}")
    for shard_idx in range(num_shards)
])

Topic		Replies	Views
Saving train/val/test datasets 🤗Datasets	2	3540	August 25, 2021
Working with large datasets 🤗Datasets	5	4155	November 10, 2020
Load Dataset and Save as Parquet 🤗Datasets	3	4139	January 7, 2025
Load shards as one dataset 🤗Datasets	0	183	February 16, 2024
Save and load datasets 🤗Datasets	2	39421	August 16, 2021

How to save datasets as distributed with save_to_disk?

Related topics