I’m trying to work with Common Voice (13) but it’s too large for my hard drive and I keep running out of space. Before you suggest using streaming=True
, please read the rest.
It’s too large since I need all the 60 locales. But I’m a reasonable man, and I don’t expect to work with all the samples. Just 20k samples per locale is enough for my purpose. But I cannot find a way to select only 20k first samples without downloading the whole dataset first. Before you suggest using split="train[0:20000]"
, please read the rest.
Before telling you about the problems I’m facing, here’s what I’m trying to achieve:
locales = ["ab", "ar", "as", "br", "ca", "cnh",
"cs", "cv", "cy", "de", "dv", "el",
"en","eo", "es", "et", "eu", "fa",
"fi", "fr", "fy-NL", "ga-IE", "hi",
"hsb", "hu", "ia", "id", "it", "ja",
"ka", "kab", "ky", "lg", "lt", "lv",
"mn", "mt", "nl", "or", "pa-IN",
"pl", "pt", "rm-sursilv",
"rm-vallader", "ro", "ru", "rw",
"sah", "sl", "sv-SE", "ta", "th",
"tr", "tt", "uk", "vi", "vot",
"zh-CN", "zh-HK", "zh-TW"]
splits = ["train", "validation", "test"]
for s in splits:
Path(f"./cv_13/{s}").mkdir(parents=True, exist_ok=True)
mapped_datasets = []
for l in locales:
dataset = load_dataset("mozilla-foundation/common_voice_13_0",
l, split=f"{s}")
transformed_dataset = dataset.map(process)
mapped_datasets.append(transformed_dataset)
combined_datasets = concatenate_datasets(mapped_datasets)
combined_datasets.save_to_disk(f"./cv_13/{s}")
What I’m trying to achieve here is to process the dataset and save it for the future. It’s just that due to capacity limitations, I need to do this for 20k samples per locale only.
Here’s my dilemma:
- If I go with
streaming=True
, then I cannot specify the 20k limit. - If I go with
split="train[0:20000]"
, for the locales with less than 20k samples, this option will error out. - If I go with both options at the same time:
ValueError: Bad split: train[20000]. Available splits: ['train', 'validation', 'test', 'other', 'invalidated']
And I cannot know the total number of samples per local without first downloading the whole dataset. - If I go with
dataset.select(range(20000))
, the whole dataset will be downloaded which eventually will lead to out of space.
Does anyone know a middle ground? How can I process the first 20k samples of a dataset without downloading the whole thing?