How to save datasets as distributed with save_to_disk?

lhoestq · November 15, 2022, 5:54pm

Hi ! Right now you have to shard the dataset yourself to save multiple files, but I’m working on supporting saving into multiple files, it will be available soon

In the meantime you can do:

ds = load_dataset(...)
num_shards = 32
for shard_idx in range(num_shards):
    shard = ds.shard(num_shards=num_shards, index=shard_idx, contiguous=True)
    shard.save_to_disk(f"path/to/shard_{shard_idx}")

# reload later
from datasets import load_from_disk, concatenate_datasets
ds = concatenate_datasets([
    load_from_disk(f"path/to/shard_{shard_idx}")
    for shard_idx in range(num_shards)
])

Topic		Replies	Views
Saving train/val/test datasets 🤗Datasets	2	3553	August 25, 2021
Working with large datasets 🤗Datasets	5	4173	November 10, 2020
Load Dataset and Save as Parquet 🤗Datasets	3	4307	January 7, 2025
Load shards as one dataset 🤗Datasets	0	185	February 16, 2024
Save and load datasets 🤗Datasets	2	39687	August 16, 2021

How to save datasets as distributed with save_to_disk?

Related topics