Common Voice: Load validated split Hugging Face data?

RikRaes · May 30, 2023, 12:36pm

When loading the data from Hugging Face ( mozilla-foundation/common_voice_13_0), it does not seem possible to load the validated split for the Dutch language, as provided in the image below. I use the following lines of code to load the data.

from datasets import load_dataset
load_dataset("mozilla-foundation/common_voice_13_0", "nl", streaming=False)

I would like to load all 86798 instances which can be downloaded from the common voice project itself, using the load_dataset(), but this does not seem possible. Furthermore, Hugging Face provides that the ‘nl’ data set should have this number of instances in the validated split, but I cannot seem to load it? When attempting this for other languages, it does also not provide the option for a validated split.

stevhliu · May 30, 2023, 4:22pm

Hi! If you want to only load the validation split, you can specify that in load_dataset:

from datasets import load_dataset

load_dataset("mozilla-foundation/common_voice_13_0", "nl", split="validation", streaming=False)

bozden · May 30, 2023, 6:37pm

I think, @RikRaes is asking for the validated.tsv, which is provided in the Common Voice dataset, which includes all validated recordings - not the validation (i.e. default dev) split.

HF datasets only provide default splits from CV…

KathyReid · May 31, 2023, 5:46am

@bozden is correct. I did some more digging into this issue.

HF does not pull down the raw Common Voice dataset - there’s a Python layer over the top that splits the data into languages, and the splits. It does include the invalidated.tsv as a default split, but does not include the validated.tsv as a split. Therefore you cannot load this split in the load_datasets function, as it does not exist on the HF dataset.

The specific lines of code this is happening in are:

In this line of code, the splits are defined. They do not include invalidated.tsv.

I tried something slightly devious, and wondered whether, even if the validated.tsv is not available via the interface, where it is still available on disk.

The format that HF uses to access the sharded data files is on this line:
_AUDIO_URL = _BASE_URL + "audio/{lang}/{split}/{lang}_{split}_{shard_idx}.tar"

So, I constructed a URL that assumed that validated.tsv had been downloaded to disk, and tried to see if it would load - it did not, and returned Entry not found.

The problem here is that the validated.tsv hasn’t been downloaded as part of the Hugging Face implementation of Common Voice.

RikRaes · June 5, 2023, 2:25pm

Thanks @bozden. I have followed your tips (here and on the Mozilla forum) and now successfully have access to the validated split through a private repo, thanks a lot!

Topic		Replies	Views
Could not load common_voice dataset 🤗Datasets	1	266	December 15, 2023
Unable to load mozila-foundation/common_voice_8_0 Beginners	4	1767	March 18, 2022
Load_dataset split=‘test’ not working again Beginners	3	26	April 19, 2025
Loading custom audio dataset and fine-tuning model Beginners	6	3238	December 12, 2023
Unable to load common_voice dataset 🤗Transformers	0	531	February 11, 2022

Common Voice: Load validated split Hugging Face data?

Related topics