IndexError using save_to_disk

bvandy · March 2, 2023, 5:41pm

In my notebook, I generate a python dictionary with data which is converted to a dataset using Dataset.from_dict(). When I call save_to_disk on the dataset, I receive “IndexError: Index 1 out of range for dataset of size 1.” (see below for image of full error)

It is as though part of the datasets code thinks there is only 1 shard and another part of the datasets code thinks there should be 2 shards.

I found that if I specify num_shards=1, then SOMETIMES the save_to_disk will work. If I try to specify max_shard_size, the IndexError is still seen.

When num_shards=1 is set and the save does NOT work, there is no error message, the code window does finish processing, and the directory that should have the saved files is empty.

As seen in the screenshot, the dataset only contains a single row (which is correct), but it should be noted that the data for the ‘window_incidices’ and ‘embeddings’ columns is very large. If num_shards=1 is set and the dataset is saved to disk, the folder size is around 750 MB.

Python 3.8.10
Datasets library 2.9.0

bvandy · March 2, 2023, 6:24pm

After restarting the kernel and running it again, I was able to use num_shards=1, but without num_shards it still fails with the IndexError:

ProfessorQ · November 9, 2023, 10:42pm

Has this error been resolved?
I noticed I have this error too.

lauriasuo · February 1, 2024, 9:16am

Atleast in my case the index errors when saving a dataset with only one row were caused by the size of the row exceeding max_shard_size (default is 500MB).

When size of the dataset exceeds max_shard_size, the dataset is split into multiple shards. However, it seems that sharding is done row-wise: the writer expects at least as many rows as the number of shards. In the case of a dataset with one row whose size is larger than max_shard_size, this will cause an indexing error. I suspect the silent error might be caused by conflict from num_shards=1 when size(row) > max_shard_size.

Tldr; Either increase max_shard_size to be larger than the size of any individual row, or split your data into multiple rows.

Topic		Replies	Views
Working with large datasets 🤗Datasets	5	4175	November 10, 2020
Dataset shows 0 rows when loaded but full when pushed 🤗Datasets	0	423	July 26, 2023
Error While Saving Dataset with PyArrow 🤗Datasets	0	80	December 14, 2024
Recommended max size of dataset? 🤗Datasets	5	204	March 11, 2025
Saving custom dataset does not finish 🤗Datasets	3	1148	March 16, 2023

IndexError using save_to_disk

Related topics