Error loading dataset

‘NonMatchingChecksumError: Checksums didn’t match for dataset source files:’

Hi There,
any idea why i am getting this error when trying to download the gigaword dataset using load_dataset.

here is my code
dataset = load_dataset(‘gigaword’)

Hi! I’ve opened a PR with the fix: Fix gigaword download url by mariosasko · Pull Request #3775 · huggingface/datasets · GitHub. After it is merged, you can download the updateted script as follows:

from datasets import load_dataset
dataset = load_dataset("gigaword", revision="master")
1 Like

thank you :slight_smile:

HI mario
Do you happen to know when it will likely be merged. Someone on the git post said he thinks it’s already been fixed on another issue but i’m still getting the same error.

Hi! Follow this comment of mine to fix the issue: Checksums didn't match for dataset source ¡ Issue #3792 ¡ huggingface/datasets ¡ GitHub

1 Like

Thank you so much. It works now

The root cause of this issue is indeed a change in the service of Google Drive. Since that change, our Datasets library downloads the Google Drive virus warning page (instead of the data file). Thus the checksum error.

We have already fixed the root cause: Fix Google Drive URL to avoid Virus scan warning by albertvillanova ¡ Pull Request #3787 ¡ huggingface/datasets ¡ GitHub
This fix will be available through PyPI after our next library release (in the coming days).

In the meantime, you can incorporate this “fix” by installing our library from the GitHub master branch:

pip install git+https://github.com/huggingface/datasets#egg=datasets

Then, if you had previously tried to load the data and got the checksum error, you should force the redownload of the data (before the fix, you just downloaded and cached the virus scan warning page, instead of the data file):

load_dataset("...", download_mode="force_redownload")

1 Like

Hi, I am currently having the same issue with the yelp_review_full dataset. Is the fix already available? I have tried to update the transformers and datasets library, but that did not fix it.