Uploading a heavy dataset to Jean-Zay

XavierValeuriad · February 17, 2025, 3:43pm

Dear HuggingFace developers,

I would like to upload some heavy datasets (more than 1 Tb, for instance RedPajama-V1) to the super computer Jean-Zay (France). For security reasons, the only way I found was to upload and save it piece by piece on my own professional computer and then upload the pieces one after the other to Jean-Zay, then deleting the arrow tables to free disk space on my computer, and restarting the program to download the next pieces of the dataset. The saved pieces of dataset are the arrow tables, but in practice I used the batch method as a proxy, like this :

`dataset_stream = load_dataset(
dataset_name,
split=‘train’,
streaming=True,
).batch(30_000)

for batch in tqdm(dataset_stream):
Dataset.from_dict(batch).save_to_disk(saving_path)
`

This is not easy. It crashes times to times for various reasons, including connection cut. The best would be to have a method to download the arrow tables by requesting their index. Because it seems that after the program crashes, that I can’t restart the stream from the middle of an IterableDataset, but instead only from the beginning, which is unsuitable. Of course downloading the whole dataset of several terrabytes at once is not possible on my personal computer.

Do you have any suggestion of how to efficiently deal with big data with HuggingFace datasets.Dataset objects ?

Best regards,
Xavier

John6666 · February 17, 2025, 4:46pm

I can’t find any best practices for working with large Hugging Face datasets… @lhoestq

github.com/huggingface/datasets

Batched IterableDataset

opened 11:12AM - 05 Oct 23 UTC

lneukom

enhancement

### Feature request Hi, could you add an implementation of a batched `Itera…bleDataset`. It already support an option to do batch iteration via `.iter(batch_size=...)` but this cannot be used in combination with a torch `DataLoader` since it just returns an iterator. ### Motivation The current implementation loads each element of a batch individually which can be very slow in cases of a big batch_size. I did some experiments [here](https://discuss.huggingface.co/t/slow-dataloader-with-big-batch-size/57224) and using a batched iteration would speed up data loading significantly. ### Your contribution N/A

github.com/huggingface/datasets

Save and resume the state of a DataLoader

opened 10:58AM - 23 Jan 23 UTC

lhoestq

enhancement generic discussion

It would be nice when using `datasets` with a PyTorch DataLoader to be able to r…esume a training from a DataLoader state (e.g. to resume a training that crashed) What I have in mind (but lmk if you have other ideas or comments): For map-style datasets, this requires to have a PyTorch Sampler state that can be saved and reloaded per node and worker. For iterable datasets, this requires to save the state of the dataset iterator, which includes: - the current shard idx and row position in the current shard - the epoch number - the rng state - the shuffle buffer Right now you can already resume the data loading of an iterable dataset by using `IterableDataset.skip` but it takes a lot of time because it re-iterates on all the past data until it reaches the resuming point. cc @stas00 @sgugger

lhoestq · February 17, 2025, 5:05pm

You can resume an IterableDataset using .state_dict() and .load_state_dict()

RedPajama is a bit particular because it’s based on a python script to load and parse the data, and this script can be read by datasets but not by other data tools. It’s a legacy way of sharing datasets and is discouraged, it would be cool to have this dataset in a standard data format instead.

John6666 · February 17, 2025, 5:13pm

Thank you!

Topic		Replies	Views
Streaming for Saving 🤗Datasets	1	39	January 26, 2025
Handling Large-Scale Image Dataset 🤗Datasets	6	81	February 9, 2025
Extremely slow data loading of imagefolder 🤗Datasets	9	2430	January 4, 2024
Save `DatasetDict` to HuggingFace Hub 🤗Datasets	12	7423	October 20, 2023
How do I download and load a dataset in batches without caching all of it? 🤗Datasets	1	226	September 16, 2024

Uploading a heavy dataset to Jean-Zay

Related topics