Hello,
I’ve been struggling with large datasets on our HPC cluster and I think I don’t understand fully how to use the datasets library.
I want to use fineweb-edu, so I downloaded the dataset using snapshot_download
as presented below. I skip the sample parts. The dataset is stored on a NFS server.
...
_ = snapshot_download(
"HuggingFaceFW/fineweb-edu",
repo_type="dataset",
allow_patterns=["data/*"],
local_dir=os.path.join(target_dir, "HuggingFaceFW/fineweb-edu"),
)
...
Now I want to load this dataset. As I understand, load_dataset
will create an optimized cache version of this dataset. So I specify the location of the cache using the HF_HOME
variable and load the dataset.
...
_ = load_dataset(os.path.join(target_dir, "HuggingFaceFW/fineweb-edu", "data"), num_proc=12)
...
The cached version is correctly built, so far so good. Now I want to load this cached version without having access to the original parquet files. How can I do that ? I’ve read the documentation several times and the best way seems to use the arrow
builder with load_dataset("arrow", data_files=data_files)
but the process is building another cached version of this dataset. I’ve tried to disable caching without success.
I am sure I am missing something, could someone point me in the right direction ?
Thanks,
Julien
1 Like
Thanks for the pointers, I’ve read them already.
I think my question could be rephrased: how can we manage large datasets in large companies ? More precisely, I am looking at a way to avoid to have multiple copies of the same dataset at different locations and minimize the number of download (ideally only one).
I’ve been experimenting since my initial question and I’ve come up with this workflow:
- Download the dataset with
snapshot_download()
or git
.
- One person load the dataset with
load_dataset()
and export it with save_to_disk()
- Other people that want to use the dataset can make a local copy, do their stuff and remove everything afterwards.
Does that make sense ?
One thing I don’t understand is that load_dataset()
and save_to_disk()
both save the dataset in arrow format. However, as far as I understand, they do not perform the same optimizations, and therefore load_from_disk
cannot load the dataset directly from the cache. Is there any particular reason for that ?
Bests
Julien
1 Like
load_dataset()
and save_to_disk()
Even though they are saved in the same Arrow format, there is a difference in purpose between long-term storage and internal caching for speed optimization. Some people seem to be trying to reuse them.
Additionally, while I am unsure how robust it is in multi-process or multi-user environments, one possible solution to avoid re-downloading the cache is to simply set HF_HOME
to a relatively fast shared folder on the network. For remote environments, services like S3 seem to be available by default. However, this does come at a cost…
By the way, for enterprise use cases, there is an option to consult dedicated support on Hugging Face. Whether this is suitable or not will depend on the scale of the project.