Take super long time to load common voice

How long does it take to load common voice EN from the hub?

My script has been runing for 30h but still cant even finish loading the en subset.

Any idea to speed up the process? I already include num_proc

ds_id = "mozilla-foundation/common_voice_17_0"
config_list = ["en", "es", "fr", "de", "it", "pt", "zh"]
# config_list = ["en"]

# ds = load_dataset(ds_id, config, trust_remote_code=True, download_mode="force_redownload")
for config in config_list:
    ds = load_dataset(ds_id, config, trust_remote_code=True, num_proc=8)
    print(f"Config: {config}")
    print(ds)
1 Like

It’s a dataset of about 500GB…

In my environment, it takes about 50 seconds to display the first one in streaming. It seems like it would take a long time to display all of them…

from datasets import load_dataset

ds_id = "mozilla-foundation/common_voice_17_0"
#config_list = ["en", "es", "fr", "de", "it", "pt", "zh"]
config_list = ["en"]

for config in config_list:
    #ds = load_dataset(ds_id, config, trust_remote_code=True, num_proc=8)
    ds = load_dataset(ds_id, config, trust_remote_code=True, split="train", streaming=True)
    print(f"Config: {config}")
    #print(ds)
    print(next(iter(ds)))
    #Config: en
    #Reading metadata...: 1101170it [00:48, 22515.49it/s]
    #{'client_id': 'f15d2e0fd19c04421174108a8c02c3c2ef8e76365cdcc48090b927eca6a1d7f130bd87a104ba14cb4306adab846bb0103ba741261c64d346a94e797b9d6b659e', 'path': 'en_train_0/common_voice_en_17924809.mp3', 'audio': {'path': 'en_train_0/common_voice_en_17924809.mp3', 'array': array([ 0.00000000e+00, -2.50554802e-15, -4.35167835e-16, ...,
    #    5.89446863e-05, -4.46442282e-05, -6.63674437e-05]), 'sampling_rate': 48000}, 'sentence': 'Every evening, the dogs in our neighbourhood are howling.', 'up_votes': 2, 'down_votes': 0, 'age': '', 'gender': '', 'accent': '', 'locale': 'en', 'segment': '', 'variant': ''}

Is there any way to make it faster if I want a dataset that has random access

1 Like

When using StreamingDataset, the only thing I can think of is to use .shuffle and .take to pick up data pseudo-randomly…
If we can load it as a normal Dataset somehow, we can use .select simply.

from datasets import load_dataset

ds_id = "mozilla-foundation/common_voice_17_0"
#config_list = ["en", "es", "fr", "de", "it", "pt", "zh"]
config_list = ["en"]

for config in config_list:
    ds = load_dataset(ds_id, config, trust_remote_code=True, split="train", streaming=True)
    print(f"Config: {config}")
    print("Shuffling...")
    ds = ds.shuffle(seed=42, buffer_size=10_000) # https://huggingface.co/docs/datasets/v4.0.0/stream#shuffle
    ds = ds.take(2) # https://huggingface.co/docs/datasets/v4.0.0/en/package_reference/main_classes#datasets.IterableDataset.take
    print(list(ds))

#Config: en
#Shuffling...
#Reading metadata...: 1101170it [00:32, 33788.82it/s]
#[{'client_id': '403e12b207165717054b33f2499ef897add3c30d6f23db3d59657269eb7c76136c3dc483075b9230f26bac3cbdfb749524575f2ce4a96da5a1b9dfe93834801e', 'path': 'en_train_7/common_voice_en_20044076.mp3', 'audio': {'path': 'en_train_7/common_voice_en_20044076.mp3', 'array': array([ 0.00000000e+00, -2.55246364e-16, -3.18319667e-16, ...,
#       -8.15750082e-06, -1.59247902e-05, -3.02179451e-05]), 'sampling_rate': 48000}, 'sentence': 'Public opinion in Italy was outraged.', 'up_votes': 2, 'down_votes': 0, 'age': 'twenties', 'gender': 'male_masculine', 'accent': '', 'locale': 'en', 'segment': '', 'variant': ''}, {'client_id': '6c602be8a0dccb7a1a888009bc2211262882f091b85a79c86d0384f6df09f7b7dde560fbdff1f7b8e51dc0944afa85bcffd45c28a333b2dc2fcfce56972e3d9f', 'path': 'en_train_7/common_voice_en_25336162.mp3', 'audio': {'path': 'en_train_7/common_voice_en_25336162.mp3', 'array': array([ 2.01526852e-14,  3.12324653e-13,  4.46545295e-13, ...,
#       -2.86631730e-04, -2.47185675e-04, -1.32337733e-04]), 'sampling_rate': 48000}, 'sentence': 'He and his second wife live on "Mirene", a -long working tugboat.', 'up_votes': 2, 'down_votes': 1, 'age': '', 'gender': '', 'accent': '', 'locale': 'en', 'segment': '', 'variant': ''}]

Use streaming=True to avoid full download:

ds = load_dataset(ds_id, config, streaming=True)

1 Like

In my case, I also want to use faiss on some specific split. So I think I would need to load it as normal dataset instead of iterabledataset.

I think we can manually select files you want to download with load dataset right? I think i only need test and validated split from the en subset.

Would you kindly show me or give me some hints how to do that?

1 Like

we can manually select files you want to download with load dataset right?

It seems possible, but skipping the formatting by the Builder script for each Dataset may not produce the desired results.

I think i only need test and validated split from the en subset.

Yeah. Just like this.

ds = load_dataset(ds_id, config, trust_remote_code=True, num_proc=8, split=["test", "validation"])
#ds = load_dataset(ds_id, config, trust_remote_code=True, num_proc=8, split="test+validation") # If you want to combine them