Hello, I am trying to use the Mozilla common voice datasets and I ran into a problem that I don’t have enough storage to save the ones that I am using.
I only need 20K samples from each dataset that I use (for reference, the English one has ~400K rows). I saw that there is a way to stream the data instead of downloading it, but when I tried doing so I ran into more memory problems.
Running the following code:
from datasets import load_dataset, DatasetDict, Dataset
language_code = "en"
num_samples_train = 20000
num_samples_test = 2000
common_voice = DatasetDict({
"train": [],
"test": []
})
ds_train = load_dataset("mozilla-foundation/common_voice_11_0", language_code, split="train", streaming=True)
ds_test = load_dataset("mozilla-foundation/common_voice_11_0", language_code, split="test", streaming=True)
train_samples = [{"sentence": x["sentence"], "audio": x["audio"]} for x in ds_train.take(num_samples_train)]
test_samples = [{"sentence": x["sentence"], "audio": x["audio"]} for x in ds_test.take(num_samples_test)]
print("finished loading data")
common_voice["train"] = Dataset.from_list(train_samples)
common_voice["test"] = Dataset.from_list(test_samples)
print(common_voice)
Raised the error: pyarrow.lib.ArrowMemoryError: realloc of size 2571632640 failed
in line common_voice["train"] = Dataset.from_list(train_samples)
Would like to know how can I do it more efficiently or how to save the subset of the dataset and then load it.
Thanks