Common Voice: Load validated split Hugging Face data?

@bozden is correct. I did some more digging into this issue.

HF does not pull down the raw Common Voice dataset - there’s a Python layer over the top that splits the data into languages, and the splits. It does include the invalidated.tsv as a default split, but does not include the validated.tsv as a split. Therefore you cannot load this split in the load_datasets function, as it does not exist on the HF dataset.

The specific lines of code this is happening in are:

In this line of code, the splits are defined. They do not include invalidated.tsv.

I tried something slightly devious, and wondered whether, even if the validated.tsv is not available via the interface, where it is still available on disk.

The format that HF uses to access the sharded data files is on this line:
_AUDIO_URL = _BASE_URL + "audio/{lang}/{split}/{lang}_{split}_{shard_idx}.tar"

So, I constructed a URL that assumed that validated.tsv had been downloaded to disk, and tried to see if it would load - it did not, and returned Entry not found.

The problem here is that the validated.tsv hasn’t been downloaded as part of the Hugging Face implementation of Common Voice.

1 Like