Take super long time to load common voice

MagicLuke · July 15, 2025, 10:30pm

How long does it take to load common voice EN from the hub?

My script has been runing for 30h but still cant even finish loading the en subset.

Any idea to speed up the process? I already include num_proc

ds_id = "mozilla-foundation/common_voice_17_0"
config_list = ["en", "es", "fr", "de", "it", "pt", "zh"]
# config_list = ["en"]

# ds = load_dataset(ds_id, config, trust_remote_code=True, download_mode="force_redownload")
for config in config_list:
    ds = load_dataset(ds_id, config, trust_remote_code=True, num_proc=8)
    print(f"Config: {config}")
    print(ds)

John6666 · July 16, 2025, 3:12am

It’s a dataset of about 500GB…

In my environment, it takes about 50 seconds to display the first one in streaming. It seems like it would take a long time to display all of them…

from datasets import load_dataset

ds_id = "mozilla-foundation/common_voice_17_0"
#config_list = ["en", "es", "fr", "de", "it", "pt", "zh"]
config_list = ["en"]

for config in config_list:
    #ds = load_dataset(ds_id, config, trust_remote_code=True, num_proc=8)
    ds = load_dataset(ds_id, config, trust_remote_code=True, split="train", streaming=True)
    print(f"Config: {config}")
    #print(ds)
    print(next(iter(ds)))
    #Config: en
    #Reading metadata...: 1101170it [00:48, 22515.49it/s]
    #{'client_id': 'f15d2e0fd19c04421174108a8c02c3c2ef8e76365cdcc48090b927eca6a1d7f130bd87a104ba14cb4306adab846bb0103ba741261c64d346a94e797b9d6b659e', 'path': 'en_train_0/common_voice_en_17924809.mp3', 'audio': {'path': 'en_train_0/common_voice_en_17924809.mp3', 'array': array([ 0.00000000e+00, -2.50554802e-15, -4.35167835e-16, ...,
    #    5.89446863e-05, -4.46442282e-05, -6.63674437e-05]), 'sampling_rate': 48000}, 'sentence': 'Every evening, the dogs in our neighbourhood are howling.', 'up_votes': 2, 'down_votes': 0, 'age': '', 'gender': '', 'accent': '', 'locale': 'en', 'segment': '', 'variant': ''}

MagicLuke · July 16, 2025, 4:29am

Is there any way to make it faster if I want a dataset that has random access

John6666 · July 16, 2025, 7:52am

When using StreamingDataset, the only thing I can think of is to use .shuffle and .take to pick up data pseudo-randomly…
If we can load it as a normal Dataset somehow, we can use .select simply.

from datasets import load_dataset

ds_id = "mozilla-foundation/common_voice_17_0"
#config_list = ["en", "es", "fr", "de", "it", "pt", "zh"]
config_list = ["en"]

for config in config_list:
    ds = load_dataset(ds_id, config, trust_remote_code=True, split="train", streaming=True)
    print(f"Config: {config}")
    print("Shuffling...")
    ds = ds.shuffle(seed=42, buffer_size=10_000) # https://huggingface.co/docs/datasets/v4.0.0/stream#shuffle
    ds = ds.take(2) # https://huggingface.co/docs/datasets/v4.0.0/en/package_reference/main_classes#datasets.IterableDataset.take
    print(list(ds))

#Config: en
#Shuffling...
#Reading metadata...: 1101170it [00:32, 33788.82it/s]
#[{'client_id': '403e12b207165717054b33f2499ef897add3c30d6f23db3d59657269eb7c76136c3dc483075b9230f26bac3cbdfb749524575f2ce4a96da5a1b9dfe93834801e', 'path': 'en_train_7/common_voice_en_20044076.mp3', 'audio': {'path': 'en_train_7/common_voice_en_20044076.mp3', 'array': array([ 0.00000000e+00, -2.55246364e-16, -3.18319667e-16, ...,
#       -8.15750082e-06, -1.59247902e-05, -3.02179451e-05]), 'sampling_rate': 48000}, 'sentence': 'Public opinion in Italy was outraged.', 'up_votes': 2, 'down_votes': 0, 'age': 'twenties', 'gender': 'male_masculine', 'accent': '', 'locale': 'en', 'segment': '', 'variant': ''}, {'client_id': '6c602be8a0dccb7a1a888009bc2211262882f091b85a79c86d0384f6df09f7b7dde560fbdff1f7b8e51dc0944afa85bcffd45c28a333b2dc2fcfce56972e3d9f', 'path': 'en_train_7/common_voice_en_25336162.mp3', 'audio': {'path': 'en_train_7/common_voice_en_25336162.mp3', 'array': array([ 2.01526852e-14,  3.12324653e-13,  4.46545295e-13, ...,
#       -2.86631730e-04, -2.47185675e-04, -1.32337733e-04]), 'sampling_rate': 48000}, 'sentence': 'He and his second wife live on "Mirene", a -long working tugboat.', 'up_votes': 2, 'down_votes': 1, 'age': '', 'gender': '', 'accent': '', 'locale': 'en', 'segment': '', 'variant': ''}]

Amber14L · July 16, 2025, 9:12am

MagicLuke:

How long does it take to load common voice EN from the hub?

My script has been runing for 30h but still cant even finish loading the en subset.

Any idea to speed up the process? I already include num_proc
ds_id = "mozilla-foundation/common_voice_17_0"
config_list = ["en", "es", "fr", "de", "it", "pt", "zh"]
# config_list = ["en"]

# ds = load_dataset(ds_id, config, trust_remote_code=True, download_mode="force_redownload")
for config in config_list:
    ds = load_dataset(ds_id, config, trust_remote_code=True, num_proc=8)
    print(f"Config: {config}")
    print(ds)

Use streaming=True to avoid full download:

ds = load_dataset(ds_id, config, streaming=True)

MagicLuke · July 16, 2025, 4:18pm

In my case, I also want to use faiss on some specific split. So I think I would need to load it as normal dataset instead of iterabledataset.

I think we can manually select files you want to download with load dataset right? I think i only need test and validated split from the en subset.

Would you kindly show me or give me some hints how to do that?

John6666 · July 16, 2025, 11:24pm

we can manually select files you want to download with load dataset right?

It seems possible, but skipping the formatting by the Builder script for each Dataset may not produce the desired results.

I think i only need test and validated split from the en subset.

Yeah. Just like this.

ds = load_dataset(ds_id, config, trust_remote_code=True, num_proc=8, split=["test", "validation"])
#ds = load_dataset(ds_id, config, trust_remote_code=True, num_proc=8, split="test+validation") # If you want to combine them

Topic		Replies	Views
Unable to load mozila-foundation/common_voice_8_0 Beginners	4	1769	March 18, 2022
Common Voice 8.0.0 en using all available RAM 🤗Datasets	7	907	August 5, 2022
Unable to load mozilla-foundation/common_voice_6_0 dataset 🤗Datasets	2	1211	April 4, 2022
How to save/use only the first 20k samples of a dataset 🤗Datasets	1	63	December 23, 2024
"Too many open files" when loading Common Voice 🤗Datasets	4	1367	February 8, 2022

Take super long time to load common voice

Related topics