IndexError using save_to_disk

In my notebook, I generate a python dictionary with data which is converted to a dataset using Dataset.from_dict(). When I call save_to_disk on the dataset, I receive “IndexError: Index 1 out of range for dataset of size 1.” (see below for image of full error)

It is as though part of the datasets code thinks there is only 1 shard and another part of the datasets code thinks there should be 2 shards.

I found that if I specify num_shards=1, then SOMETIMES the save_to_disk will work. If I try to specify max_shard_size, the IndexError is still seen.

When num_shards=1 is set and the save does NOT work, there is no error message, the code window does finish processing, and the directory that should have the saved files is empty.

As seen in the screenshot, the dataset only contains a single row (which is correct), but it should be noted that the data for the ‘window_incidices’ and ‘embeddings’ columns is very large. If num_shards=1 is set and the dataset is saved to disk, the folder size is around 750 MB.

Python 3.8.10
Datasets library 2.9.0

After restarting the kernel and running it again, I was able to use num_shards=1, but without num_shards it still fails with the IndexError:

Has this error been resolved?
I noticed I have this error too.

Atleast in my case the index errors when saving a dataset with only one row were caused by the size of the row exceeding max_shard_size (default is 500MB).

When size of the dataset exceeds max_shard_size, the dataset is split into multiple shards. However, it seems that sharding is done row-wise: the writer expects at least as many rows as the number of shards. In the case of a dataset with one row whose size is larger than max_shard_size, this will cause an indexing error. I suspect the silent error might be caused by conflict from num_shards=1 when size(row) > max_shard_size.

Tldr; Either increase max_shard_size to be larger than the size of any individual row, or split your data into multiple rows.