Let’s say I want to upload a private datasets to the Hugging Face Hub, under patrickvonplaten/dataset_new/
. The dataset contains both files and a dataset_new.py
loading script.
So I want to load the dataset with
load_dataset("patrickvonplaten/dataset_new", use_auth_token=True)
The dataset repo will contain some metadata files which are important to split the actual data of the dataset which is downloaded from an external link, let’s call it patrickvonplaten/dataset_new/splits.txt
This means in the dataset script dataset_new.py
, first I load & extract the data from the external link:
archive_path = dl_manager.download_and_extract("<external_link>")
Now I also need the splits.txt
file - how do I load it?
I can’t just do:
splits_path = dl_manager.download("https://huggingface.co/datasets/patrickvonplaten/dataset_new/raw/main/splits.txt")
since it’s a private dataset and also it’s probably not the cleanest way of loading data.
Any ideas on what should be used here? @lhoestq @mariosasko @albertvillanova