I would like to upload some heavy datasets (more than 1 Tb, for instance RedPajama-V1) to the super computer Jean-Zay (France). For security reasons, the only way I found was to upload and save it piece by piece on my own professional computer and then upload the pieces one after the other to Jean-Zay, then deleting the arrow tables to free disk space on my computer, and restarting the program to download the next pieces of the dataset. The saved pieces of dataset are the arrow tables, but in practice I used the batch method as a proxy, like this :
for batch in tqdm(dataset_stream):
Dataset.from_dict(batch).save_to_disk(saving_path)
`
This is not easy. It crashes times to times for various reasons, including connection cut. The best would be to have a method to download the arrow tables by requesting their index. Because it seems that after the program crashes, that I can’t restart the stream from the middle of an IterableDataset, but instead only from the beginning, which is unsuitable. Of course downloading the whole dataset of several terrabytes at once is not possible on my personal computer.
Do you have any suggestion of how to efficiently deal with big data with HuggingFace datasets.Dataset objects ?
You can resume an IterableDataset using .state_dict() and .load_state_dict()
RedPajama is a bit particular because it’s based on a python script to load and parse the data, and this script can be read by datasets but not by other data tools. It’s a legacy way of sharing datasets and is discouraged, it would be cool to have this dataset in a standard data format instead.