In my notebook, I generate a python dictionary with data which is converted to a dataset using Dataset.from_dict(). When I call save_to_disk on the dataset, I receive “IndexError: Index 1 out of range for dataset of size 1.” (see below for image of full error)
It is as though part of the datasets code thinks there is only 1 shard and another part of the datasets code thinks there should be 2 shards.
I found that if I specify num_shards=1, then SOMETIMES the save_to_disk will work. If I try to specify max_shard_size, the IndexError is still seen.
When num_shards=1 is set and the save does NOT work, there is no error message, the code window does finish processing, and the directory that should have the saved files is empty.
As seen in the screenshot, the dataset only contains a single row (which is correct), but it should be noted that the data for the ‘window_incidices’ and ‘embeddings’ columns is very large. If num_shards=1 is set and the dataset is saved to disk, the folder size is around 750 MB.
Datasets library 2.9.0