Cnn_dailymail dataset loading problem with Colab

The cnn_dailymail dataset was rarely downloaded successfully in the past few days.

import datasets
test_dataset = datasets.load_dataset(“cnn_dailymail”, “3.0.0”, split=“test”)

Most of the time when I try to load this dataset using Colab, it throws a “Not a directory” error:

NotADirectoryError: [Errno 20] Not a directory: ‘/root/.cache/huggingface/datasets/downloads/1bc05d24fa6dda2468e83a73cf6dc207226e01e3c48a507ea716dc0421da583b/cnn/stories’

I really don’t know why and what the exact problem is.

This wastes my time waiting for hours or days until I can load the dataset again.

Please guide me to solve this problem or to save this dataset locally so that next time I load it “when it becomes available” from my drive instead.

Thank you in advance

I would either try streaming or clear the cache, mount drive & let it save under ‘/content’.

This problem has been reported before, see Unable to load 'cnn_dailymail' dataset · Issue #3465 · huggingface/datasets · GitHub.

This is due to too many downloads on Drive (where the data is hosted), if you try again in a less busy period it will work. The Datasets team is looking into this and will provide a fix (probably, by using a different place to host the data).

2 Likes

Thank you @merve and @nielsr. Unfortunately @merve, this way didn’t solve the problem.
It seems that this issue is quite complicated as this dataset is not hosted by Huggingface, so we are forced to follow the limits of Google Drive Quota, as @nielsr mentioned.

I hope Hugingface gets the permission to host this dataset or to find some other solution because it’s really wasting our time.

Best Regards.

It seems that this copy of the dataset has fixed the problem
@merve @nielsr

1 Like