How to save/use only the first 20k samples of a dataset

Hello, I am trying to use the Mozilla common voice datasets and I ran into a problem that I don’t have enough storage to save the ones that I am using.

I only need 20K samples from each dataset that I use (for reference, the English one has ~400K rows). I saw that there is a way to stream the data instead of downloading it, but when I tried doing so I ran into more memory problems.

Running the following code:

from datasets import load_dataset, DatasetDict, Dataset
language_code = "en"
num_samples_train = 20000
num_samples_test = 2000

common_voice = DatasetDict({
    "train": [],
    "test": []
})

ds_train = load_dataset("mozilla-foundation/common_voice_11_0", language_code, split="train", streaming=True)
ds_test = load_dataset("mozilla-foundation/common_voice_11_0", language_code, split="test", streaming=True)

train_samples = [{"sentence": x["sentence"], "audio": x["audio"]} for x in ds_train.take(num_samples_train)]
test_samples = [{"sentence": x["sentence"], "audio": x["audio"]} for x in ds_test.take(num_samples_test)]

print("finished loading data")

common_voice["train"] = Dataset.from_list(train_samples)
common_voice["test"] = Dataset.from_list(test_samples)

print(common_voice)

Raised the error: pyarrow.lib.ArrowMemoryError: realloc of size 2571632640 failed in line common_voice["train"] = Dataset.from_list(train_samples)

Would like to know how can I do it more efficiently or how to save the subset of the dataset and then load it.

Thanks

2 Likes

Hello,
The problem comes when trying to load dataset in mem (doubling down on mem consumption with your “samples” lists).
How about trying to serializing your 20k sample like so :

import pyarrow as pa

# Create a pyarrow Table directly from the list of dictionaries
train_table = pa.table(train_samples)
test_table = pa.table(test_samples)

# Save the table to Arrow format
pa.ipc.write_table(train_table, train_file)
pa.ipc.write_table(test_table, test_file)

from which point you can even :

del train_samples
del train_table
del test_samples
del test_table

then trying to “load” them lazily, something like (replace full path that apply to you) :

from datasets import load_dataset

# Load Arrow files lazily
common_voice = load_dataset(
    "arrow",
    data_files={
        "train": "train_samples.arrow",
        "test": "test_samples.arrow"
    },
    split=["train", "test"]
)

Hope this helps.

1 Like