How to save/use only the first 20k samples of a dataset

NirF · December 23, 2024, 2:03pm

Hello, I am trying to use the Mozilla common voice datasets and I ran into a problem that I don’t have enough storage to save the ones that I am using.

I only need 20K samples from each dataset that I use (for reference, the English one has ~400K rows). I saw that there is a way to stream the data instead of downloading it, but when I tried doing so I ran into more memory problems.

Running the following code:

from datasets import load_dataset, DatasetDict, Dataset
language_code = "en"
num_samples_train = 20000
num_samples_test = 2000

common_voice = DatasetDict({
    "train": [],
    "test": []
})

ds_train = load_dataset("mozilla-foundation/common_voice_11_0", language_code, split="train", streaming=True)
ds_test = load_dataset("mozilla-foundation/common_voice_11_0", language_code, split="test", streaming=True)

train_samples = [{"sentence": x["sentence"], "audio": x["audio"]} for x in ds_train.take(num_samples_train)]
test_samples = [{"sentence": x["sentence"], "audio": x["audio"]} for x in ds_test.take(num_samples_test)]

print("finished loading data")

common_voice["train"] = Dataset.from_list(train_samples)
common_voice["test"] = Dataset.from_list(test_samples)

print(common_voice)

Raised the error: pyarrow.lib.ArrowMemoryError: realloc of size 2571632640 failed in line common_voice["train"] = Dataset.from_list(train_samples)

Would like to know how can I do it more efficiently or how to save the subset of the dataset and then load it.

Thanks

Aurelien-Morgan · December 23, 2024, 3:27pm

Hello,
The problem comes when trying to load dataset in mem (doubling down on mem consumption with your “samples” lists).
How about trying to serializing your 20k sample like so :

import pyarrow as pa

# Create a pyarrow Table directly from the list of dictionaries
train_table = pa.table(train_samples)
test_table = pa.table(test_samples)

# Save the table to Arrow format
pa.ipc.write_table(train_table, train_file)
pa.ipc.write_table(test_table, test_file)

from which point you can even :

del train_samples
del train_table
del test_samples
del test_table

then trying to “load” them lazily, something like (replace full path that apply to you) :

from datasets import load_dataset

# Load Arrow files lazily
common_voice = load_dataset(
    "arrow",
    data_files={
        "train": "train_samples.arrow",
        "test": "test_samples.arrow"
    },
    split=["train", "test"]
)

Hope this helps.

Topic		Replies	Views
How to process the first 20k samples of a dataset without downloading all of it? 🤗Datasets	1	1299	September 1, 2023
Common Voice 8.0.0 en using all available RAM 🤗Datasets	7	907	August 5, 2022
Saving Datasets vs Dataset Cache 🤗Datasets	1	451	February 10, 2024
Loading just part of dataset 🤗Datasets	4	4746	February 25, 2025
Is it possible to reuse only part of an already loaded audio dataset? Beginners	0	65	June 14, 2024

How to save/use only the first 20k samples of a dataset

Related topics