Unable to load mozila-foundation/common_voice_8_0

Hello there!
A little help. I’m trying to load:

mozila_voice = load_dataset(“mozilla-foundation/common_voice_8_0”, “pt”, split = “train+test+validation”)

mozila_voice

It download the DataSet(around 3.1gb), but when I try to access it that what shows:

Dataset({
    features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
    num_rows: 0
})

Zero rows, I can’t figure out what is going on.
Thank you in advance

Hmm strange to me it sounds like there is a problem with the cache. Also you need to pass use_auth_token=True normally:

mozila_voice = load_dataset(“mozilla-foundation/common_voice_8_0”, “pt”, split = “train+test+validation”, use_auth_token=True)

mozila_voice

Since it’s a gated access dataset.

Also cc @lhoestq

Thank you @patrickvonplaten .

The first time downloading the DataSet I’ve used the use_auth_token=True
Since them was not necessary as the message says:

Using the latest cached version of the module from C:\..\..\.cache\huggingface\modules\datasets_modules\datasets\mozilla-foundation--common_voice_8_0\ (last modified on Thu Mar 17 19:11:45 2022) since it couldn't be found locally at mozilla-foundation/common_voice_8_0., or remotely on the Hugging Face Hub.
Reusing dataset common_voice (C:\.

I know that I have the dataset stored on local machine, I just can’t figure out why I cannot access the entries.

Could be an issue on windows, which version of datasets are you using ?

You can also try this to regenerate the dataset :

load_dataset(..., download_mode="force_redownload")

Edit1: Seems to be something in my machine. I’m trying a Colab version and runs ok. I have not figured out what though.

Thanks @lhoestq

I’ve tried the common_voice Dataset, and it worked. The dataset was downloaded and I can access it’s features.

datasets.__version__ '2.0.0'

after using the

force_redownload I got this error:

mozila_voice = load_dataset("mozilla-foundation/common_voice_8_0", "pt",use_auth_token=True, split="test+train+validation", download_mode="force_redownload")

--------------------------------------------------------------------------- ExpectedMoreDownloadedFiles Traceback (most recent call last) c:\Users\brito\OneDrive\Documentos\1 - Data Science\5 - Projetos\13 - Hugging face\hf-asr-comp\pt-huggingface\learn.ipynb Cell 2’ in <cell line: 1> () ----> 1 mozila_voice = load_dataset(“mozilla-foundation/common_voice_8_0”, “pt”,use_auth_token=True, split=“test+train+validation”, download_mode=“force_redownload”) File c:\Users\brito\OneDrive\Documentos\1 - Data Science\5 - Projetos\13 - Hugging face\vevn-hug\lib\site-packages\datasets\load.py:1687 , in load_dataset **(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, config_kwargs) [1684](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/load.py?line=1683) try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES [1686](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/load.py?line=1685) # Download and prepare data [1687](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/load.py?line=1686) builder_instance.download_and_prepare( [1688](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/load.py?line=1687) download_config=download_config, [1689](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/load.py?line=1688) download_mode=download_mode, [1690](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/load.py?line=1689) ignore_verifications=ignore_verifications, [1691](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/load.py?line=1690) try_from_hf_gcs=try_from_hf_gcs, [1692](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/load.py?line=1691) use_auth_token=use_auth_token, [1693](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/load.py?line=1692) ) [1695](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/load.py?line=1694) # Build dataset for splits [1696](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/load.py?line=1695) keep_in_memory = ( [1697](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/load.py?line=1696) keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size) [1698](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/load.py?line=1697) ) File c:\Users\brito\OneDrive\Documentos\1 - Data Science\5 - Projetos\13 - Hugging face\vevn-hug\lib\site-packages\datasets\builder.py:605 , in DatasetBuilder.download_and_prepare **(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, download_and_prepare_kwargs) [603](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=602) logger.warning(“HF google storage unreachable. Downloading and preparing it from source”) [604](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=603) if not downloaded_from_gcs: [605](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=604) self._download_and_prepare( [606](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=605) dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs [607](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=606) ) [608](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=607) # Sync info [609](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=608) self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values()) File c:\Users\brito\OneDrive\Documentos\1 - Data Science\5 - Projetos\13 - Hugging face\vevn-hug\lib\site-packages\datasets\builder.py:1104 , in GeneratorBasedBuilder._download_and_prepare (self, dl_manager, verify_infos) [1103](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=1102) def _download_and_prepare(self, dl_manager, verify_infos): [1104](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=1103) super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos) File c:\Users\brito\OneDrive\Documentos\1 - Data Science\5 - Projetos\13 - Hugging face\vevn-hug\lib\site-packages\datasets\builder.py:676 , in DatasetBuilder._download_and_prepare **(self, dl_manager, verify_infos, prepare_split_kwargs) [674](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=673) # Checksums verification [675](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=674) if verify_infos: [676](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=675) verify_checksums( [677](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=676) self.info.download_checksums, dl_manager.get_recorded_sizes_checksums(), “dataset source files” [678](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=677) ) [680](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=679) # Build splits [681](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/builder.py?line=680) for split_generator in split_generators: File c:\Users\brito\OneDrive\Documentos\1 - Data Science\5 - Projetos\13 - Hugging face\vevn-hug\lib\site-packages\datasets\utils\info_utils.py:33 , in verify_checksums (expected_checksums, recorded_checksums, verification_name) [31](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/utils/info_utils.py?line=30) return [32](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/utils/info_utils.py?line=31) if len(set(expected_checksums) - set(recorded_checksums)) > 0: —> [33](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/utils/info_utils.py?line=32) raise ExpectedMoreDownloadedFiles(str(set(expected_checksums) - set(recorded_checksums))) [34](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/utils/info_utils.py?line=33) if len(set(recorded_checksums) - set(expected_checksums)) > 0: [35](file:///c%3A/Users/brito/OneDrive/Documentos/1%20-%20Data%20Science/5%20-%20Projetos/13%20-%20Hugging%20face/vevn-hug/lib/site-packages/datasets/utils/info_utils.py?line=34) raise UnexpectedDownloadedFile(str(set(recorded_checksums) - set(expected_checksums))) ExpectedMoreDownloadedFiles : {‘https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-8.0-2022-01-19/cv-corpus-8.0-2022-01-19-pt.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAQ3GQRTO3BNYTFWZ4%2F20220318%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20220318T133529Z&X-Amz-Expires=43200&X-Amz-Security-Token=FwoGZXIvYXdzEFcaDHIWtev44a5C%2BGwOoyKSBFZ0l9VgNCI4JnHPYsigOkO4ZBUy6PDBOEieyTMUg%2BCoPquF7YH6TiM1CYezVId%2B78DkQ2bNWCyzEPYA1fsl43rhgV8QXwvFJW2rhn1jarmgSZMkKEUFl26HMnKvMDHLOkgpl%2BXIiSCKJi6gzPeGPjKcTg7KIXpJqiZfOYtn74nQ54VvUx6pbMGfc%2Bc5gdPO14tOW4ebf8SEu0ky%2FUxw60Sf14YvwFBf4uYaXluyD1Fy0P%2BAjc0VDoTvdhGERyonDLlrGOqVEq%2F674Xlp1ZRm39SsJyB3LchU7aar8uhebtQMCBfIyeMWBeOBs8npYrR4WuDiSVCMZohcQzaTPTQ%2BTs4%2BCM7eq4XsN6bYQoTae1HHybqELeNRYi3NY%2BVGkbMylnb6S0vGuQaPBIn%2F96QfZtLCjpJv0IEBtsble8H4MLYxohf0zqHrDFKOS01WD%2FiGuzKJ5k0BHlZALui76yDqm%2BgzT%2Fro38%2Fns2tKF7fMKn%2FKuGWR6A4Rw6VWBXqgQ9r%2FhMubTfpIoEhjaThFKCkTZ8TC9pkK3b32Wsv4fCkciLHs3tg6K9KbDYvkdJTDNIJBVwwEb0wLA80ao0DHzT8ff5zlj9JmhPJPRM26Hm97uvt6YyTmnhcu4aw%2FC7JfS6ydPTxI3X%2FABdgOkFqlkSPnq%2BNPeMoarV98imzCAb%2BTziXS8crWXauM0Wo%2BAnGHYTSsLCJKOWK0pEGMirOcsMiFrqID%2BCX2ri32TiWtsGy04%2FJjbRx1z8M5QbU7Fou5lQU8FV%2BOG0%3D&X-Amz-Signature=fdbc8613d440b6b0b2bb3254cb364c03ad4866f7331701258ca73bc41c77ee85&X-Amz-SignedHeaders=host’}