Common Voice: Load validated split Hugging Face data?

When loading the data from Hugging Face ( mozilla-foundation/common_voice_13_0), it does not seem possible to load the validated split for the Dutch language, as provided in the image below. I use the following lines of code to load the data.

from datasets import load_dataset
load_dataset("mozilla-foundation/common_voice_13_0", "nl", streaming=False)

I would like to load all 86798 instances which can be downloaded from the common voice project itself, using the load_dataset(), but this does not seem possible. Furthermore, Hugging Face provides that the ‘nl’ data set should have this number of instances in the validated split, but I cannot seem to load it? When attempting this for other languages, it does also not provide the option for a validated split.

Hi! If you want to only load the validation split, you can specify that in load_dataset:

from datasets import load_dataset

load_dataset("mozilla-foundation/common_voice_13_0", "nl", split="validation", streaming=False)

I think, @RikRaes is asking for the validated.tsv, which is provided in the Common Voice dataset, which includes all validated recordings - not the validation (i.e. default dev) split.

HF datasets only provide default splits from CV…

2 Likes

@bozden is correct. I did some more digging into this issue.

HF does not pull down the raw Common Voice dataset - there’s a Python layer over the top that splits the data into languages, and the splits. It does include the invalidated.tsv as a default split, but does not include the validated.tsv as a split. Therefore you cannot load this split in the load_datasets function, as it does not exist on the HF dataset.

The specific lines of code this is happening in are:

In this line of code, the splits are defined. They do not include invalidated.tsv.

I tried something slightly devious, and wondered whether, even if the validated.tsv is not available via the interface, where it is still available on disk.

The format that HF uses to access the sharded data files is on this line:
_AUDIO_URL = _BASE_URL + "audio/{lang}/{split}/{lang}_{split}_{shard_idx}.tar"

So, I constructed a URL that assumed that validated.tsv had been downloaded to disk, and tried to see if it would load - it did not, and returned Entry not found.

The problem here is that the validated.tsv hasn’t been downloaded as part of the Hugging Face implementation of Common Voice.

1 Like

Thanks @bozden. I have followed your tips (here and on the Mozilla forum) and now successfully have access to the validated split through a private repo, thanks a lot!

1 Like