How to use several datasets that fit into the RAM?

olaffson · November 5, 2021, 2:47pm

Hello,

I have a the following issue. I am trying to train a language model in tensorflow using the nice new TF notebooks notebooks/language_modeling_from_scratch-tf.ipynb at master · huggingface/notebooks · GitHub

I know that the starting point of the training is to actually load the data using the datasets package.

from datasets import load_dataset
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")

and then do something like


tokenized_datasets = datasets.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

My issue is that my text data is stored on my disk as several parquet files with the following constraints:

each file fits into the RAM easily (say in a Pandas dataframe)
there are too many files to load them all at once in the RAM

What is the best way to proceed using datasets here? Concatenating them will not work as my RAM is not sufficient. And training in the notebook seems to only work with one instance of a datasets

What do you think?
Thanks!

lhoestq · November 5, 2021, 4:54pm

Hi ! You can load all your parquet files this way:

datasets = load_dataset("path/to/the/directory/containing/your/parquet/files")

It does two things:

it converts all the parquet files into Arrow files, and store them in the datasets cache
it opens the Arrow files using memory-mapping - that means that the part of your disk that contains the data is going to be used as virtual memory to load the dataset. Therefore it will load all your data without filling up your RAM.

Then once your dataset is loaded you can tokenize it:

tokenized_datasets = datasets.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

Topic		Replies	Views
Big text dataset loading for training 🤗Datasets	2	98	May 7, 2025
How to finetune models with own dataset in tensorflow? Models	0	261	August 19, 2021
How do I download and load a dataset in batches without caching all of it? 🤗Datasets	1	226	September 16, 2024
Best practices for a large dataset 🤗Datasets	7	1347	May 6, 2025
Support of very large dataset? 🤗Datasets	12	10348	August 24, 2022

How to use several datasets that fit into the RAM?

Related topics