How to use several datasets that fit into the RAM?

Hello,

I have a the following issue. I am trying to train a language model in tensorflow using the nice new TF notebooks notebooks/language_modeling_from_scratch-tf.ipynb at master · huggingface/notebooks · GitHub

I know that the starting point of the training is to actually load the data using the datasets package.

from datasets import load_dataset
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")

and then do something like


tokenized_datasets = datasets.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

My issue is that my text data is stored on my disk as several parquet files with the following constraints:

  1. each file fits into the RAM easily (say in a Pandas dataframe)
  2. there are too many files to load them all at once in the RAM

What is the best way to proceed using datasets here? Concatenating them will not work as my RAM is not sufficient. And training in the notebook seems to only work with one instance of a datasets

What do you think?
Thanks!

Hi ! You can load all your parquet files this way:

datasets = load_dataset("path/to/the/directory/containing/your/parquet/files")

It does two things:

  • it converts all the parquet files into Arrow files, and store them in the datasets cache
  • it opens the Arrow files using memory-mapping - that means that the part of your disk that contains the data is going to be used as virtual memory to load the dataset. Therefore it will load all your data without filling up your RAM.

Then once your dataset is loaded you can tokenize it:

tokenized_datasets = datasets.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)
1 Like