Hi friends👋
I want to use RedPajama for training a language model. I have access to a folder where the entire dataset is already downloaded and now I want to load the local files with datasets for simple usage. Other people are relying on the downloaded data. Therefore I’m not allowed to change anything in that folder. How can I do that?
I’m currently using the following code to load the dataset
os.environ["RED_PAJAMA_DATA_DIR"] = PATH_TO_DOWNLOADED_RED_PAJAMA
ds = load_dataset("togethercomputer/RedPajama-Data-1T", "book")
But when I execute that code, it generates a training split and saves everything to the cache folder creating a file as large as the original dataset. I would like to use the downloaded data and limit the additional storage requirement of the code