How to process the first 20k samples of a dataset without downloading all of it?

I’m trying to work with Common Voice (13) but it’s too large for my hard drive and I keep running out of space. Before you suggest using streaming=True, please read the rest.

It’s too large since I need all the 60 locales. But I’m a reasonable man, and I don’t expect to work with all the samples. Just 20k samples per locale is enough for my purpose. But I cannot find a way to select only 20k first samples without downloading the whole dataset first. Before you suggest using split="train[0:20000]", please read the rest.

Before telling you about the problems I’m facing, here’s what I’m trying to achieve:

locales = ["ab", "ar", "as", "br", "ca", "cnh",
           "cs", "cv", "cy", "de", "dv", "el",
           "en","eo", "es", "et", "eu", "fa",
           "fi", "fr", "fy-NL", "ga-IE", "hi",
           "hsb", "hu", "ia", "id", "it", "ja",
           "ka", "kab", "ky", "lg", "lt", "lv",
           "mn", "mt", "nl", "or", "pa-IN",
           "pl", "pt", "rm-sursilv",
           "rm-vallader", "ro", "ru", "rw",
           "sah", "sl", "sv-SE", "ta", "th",
           "tr", "tt", "uk", "vi", "vot",
           "zh-CN", "zh-HK", "zh-TW"]

splits = ["train", "validation", "test"]

for s in splits:
    Path(f"./cv_13/{s}").mkdir(parents=True, exist_ok=True)
    mapped_datasets = []
    for l in locales:
        dataset = load_dataset("mozilla-foundation/common_voice_13_0",
                               l, split=f"{s}")
        transformed_dataset = dataset.map(process)
        mapped_datasets.append(transformed_dataset)

    combined_datasets = concatenate_datasets(mapped_datasets)
    combined_datasets.save_to_disk(f"./cv_13/{s}")

What I’m trying to achieve here is to process the dataset and save it for the future. It’s just that due to capacity limitations, I need to do this for 20k samples per locale only.

Here’s my dilemma:

  1. If I go with streaming=True, then I cannot specify the 20k limit.
  2. If I go with split="train[0:20000]", for the locales with less than 20k samples, this option will error out.
  3. If I go with both options at the same time: ValueError: Bad split: train[20000]. Available splits: ['train', 'validation', 'test', 'other', 'invalidated']
    And I cannot know the total number of samples per local without first downloading the whole dataset.
  4. If I go with dataset.select(range(20000)), the whole dataset will be downloaded which eventually will lead to out of space.

Does anyone know a middle ground? How can I process the first 20k samples of a dataset without downloading the whole thing?

split=“train[0:20000]”

This syntax is still something we need to implement for the streaming mode. In the meantime, you can use dataset = dataset.take(20000) to fetch the first 20k samples.