âNonMatchingChecksumError: Checksums didnât match for dataset source files:â
Hi There,
any idea why i am getting this error when trying to download the gigaword dataset using load_dataset.
here is my code
dataset = load_dataset(âgigawordâ)
Hi! Iâve opened a PR with the fix: Fix gigaword download url by mariosasko ¡ Pull Request #3775 ¡ huggingface/datasets ¡ GitHub. After it is merged, you can download the updateted script as follows:
from datasets import load_dataset
dataset = load_dataset("gigaword", revision="master")
1 Like
HI mario
Do you happen to know when it will likely be merged. Someone on the git post said he thinks itâs already been fixed on another issue but iâm still getting the same error.
Thank you so much. It works now
The root cause of this issue is indeed a change in the service of Google Drive. Since that change, our Datasets library downloads the Google Drive virus warning page (instead of the data file). Thus the checksum error.
We have already fixed the root cause: Fix Google Drive URL to avoid Virus scan warning by albertvillanova ¡ Pull Request #3787 ¡ huggingface/datasets ¡ GitHub
This fix will be available through PyPI after our next library release (in the coming days).
In the meantime, you can incorporate this âfixâ by installing our library from the GitHub master branch:
pip install git+https://github.com/huggingface/datasets#egg=datasets
Then, if you had previously tried to load the data and got the checksum error, you should force the redownload of the data (before the fix, you just downloaded and cached the virus scan warning page, instead of the data file):
load_dataset("...", download_mode="force_redownload")
1 Like
Hi, I am currently having the same issue with the yelp_review_full dataset. Is the fix already available? I have tried to update the transformers and datasets library, but that did not fix it.