Common Voice: Load validated split Hugging Face data?

KathyReid · May 31, 2023, 5:46am

@bozden is correct. I did some more digging into this issue.

HF does not pull down the raw Common Voice dataset - there’s a Python layer over the top that splits the data into languages, and the splits. It does include the invalidated.tsv as a default split, but does not include the validated.tsv as a split. Therefore you cannot load this split in the load_datasets function, as it does not exist on the HF dataset.

The specific lines of code this is happening in are:

In this line of code, the splits are defined. They do not include invalidated.tsv.

I tried something slightly devious, and wondered whether, even if the validated.tsv is not available via the interface, where it is still available on disk.

The format that HF uses to access the sharded data files is on this line:
_AUDIO_URL = _BASE_URL + "audio/{lang}/{split}/{lang}_{split}_{shard_idx}.tar"

So, I constructed a URL that assumed that validated.tsv had been downloaded to disk, and tried to see if it would load - it did not, and returned Entry not found.

The problem here is that the validated.tsv hasn’t been downloaded as part of the Hugging Face implementation of Common Voice.

Topic		Replies	Views
Could not load common_voice dataset 🤗Datasets	1	266	December 15, 2023
Unable to load mozila-foundation/common_voice_8_0 Beginners	4	1768	March 18, 2022
Load_dataset split=‘test’ not working again Beginners	3	26	April 19, 2025
Loading custom audio dataset and fine-tuning model Beginners	6	3240	December 12, 2023
Unable to load common_voice dataset 🤗Transformers	0	531	February 11, 2022

Common Voice: Load validated split Hugging Face data?

Related topics