With 6TB, it’s not impossible to download, but it’s certainly better to be able to handle it with streaming…
If you don’t want to make too many changes to the contents of the dataset, you can write a script or builder class for loading the dataset and upload it to the repo. You can then load it by setting trust_remote_code=True .
Also, if the upload target is a media file, etc., the approach using WebDataset may be suitable for segmented uploads. @lhoestq
Loading large dataset
Hi, I have a ~1TB large dataset stored at the HF hub. I can download this on my disk and read it successfully. Nevertheless, it’s large enough that I can’t fit it in my RAM.
What is the best practice for training a model on such a dataset?
I tried loading the dataset with load_dataset(..., streaming=True) and then having two buffers: one that is being loaded by training process onto GPU and one that, in a separate thread, streams from the dataset and fills up the other buffer. Then, when the i…
Saving large dataset
I’m about to create a large dataset directly, about ~1B samples with each sample being about [16 x 8000] size and some small meta data. Do you foresee any issues during generation, or loading this and using it after it’s finished generating? Any ideas are welcome, thank you.
Dear HuggingFace developers,
I would like to upload some heavy datasets (more than 1 Tb, for instance RedPajama-V1) to the super computer Jean-Zay (France). For security reasons, the only way I found was to upload and save it piece by piece on my own professional computer and then upload the pieces one after the other to Jean-Zay, then deleting the arrow tables to free disk space on my computer, and restarting the program to download the next pieces of the dataset. The saved pieces of dataset a…
Building dataset
Could you please enumerate pros and cons for both these dataset builder classes. I couldn’t find anything in the documentation. When would I prefer one over the other. Is ArrowBasedBuilder more performant for large datasets?
Thank you!
Hello everyone!
I want to share multiple datasets in the same repo <my_username>/<my_repo_name>, each in its own folder. The datasets in each folder are already in sharded Arrow format (for best performance) and contain different splits, as usual. To read any of these datasets with load_dataset I would need a loading script to tell HF how to read from the folders, right? If so, should I use the ArrowBasedBuilder and how? I only see tutorials for GeneratorBaseBuilder!
Thanks!
Hello. I want to train my VLM using a large-scale image dataset with the Huggingface trainer.
I initially planned to follow the method suggested by HuggingFace4M , which involves embedding PIL images within arrow files. However, I found this approach problematic when dealing with datasets like DocStruct4M. The challenges included handling large files such as infographics and processing millions of data points. Furthermore, uploading such large image datasets to the hub proved difficult.
Therefo…
I’m part of the BirdSet team, and we’ve identified an issue with our current Builder script .
Some of the audio datasets we work with are quite large, and we aim to provide access to individual audio files. To achieve this, we first download the archive file, extract its contents, and then generate the dataset. The reason for accessing the audio files directly is that we don’t need to load the entire audio file but only specific parts, which is possible using the soundfile library. This approach…
WebDataset
I’m trying to train a model on this dataset: MLCommons/unsupervised_peoples_speech · Datasets at Hugging Face .
I’m using WebDataset to iterate over the tar files using a brace expansion. This is basically a wrapper on top of torch’s IterableDataset. The problem is that if I set up more than 1 worker in the loader I get the following error:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.…
Troubleshooting for dataset