Using local dataset without changing cache

Hi friendsđź‘‹
I want to use RedPajama for training a language model. I have access to a folder where the entire dataset is already downloaded and now I want to load the local files with :hugs:datasets for simple usage. Other people are relying on the downloaded data. Therefore I’m not allowed to change anything in that folder. How can I do that?

I’m currently using the following code to load the dataset

os.environ["RED_PAJAMA_DATA_DIR"] = PATH_TO_DOWNLOADED_RED_PAJAMA
ds = load_dataset("togethercomputer/RedPajama-Data-1T", "book")

But when I execute that code, it generates a training split and saves everything to the cache folder creating a file as large as the original dataset. I would like to use the downloaded data and limit the additional storage requirement of the code

Hi! load_dataset saves a generated dataset in the Arrow format to be able to memory-map it later, so this is the expected behavior. If you are not interested in random access, you can pass streaming=True to get an IterableDataset that doesn’t require cache files.

1 Like

Perfect, thank you for this straightforward solution!