Hi,
I would like to create a script for a dataset that has (1) a small publicly available version and (2) a large privately available version - by default, the publicly available dataset would be downloaded.
If the user has access to the private dataset, the user would be able to provide a custom dataset path to reuse the script with that version.
I am unsure how to implement this correctly and am looking for guidance. So far, I have tried:
(1) enabling user to specify the custom data path via an environment variable. That does not work as the initially processed version of the dataset will get cached and the environment variable will not be read.
(2) enabling user to specify the custom data path via the data_dir parameter of the datasets.load_dataset() function. This semi-works, but has problems when trying to use the script for both versions of the dataset, i.e. first loading the public dataset, then the private. The script throws a datasets.utils.info_utils.ExpectedMoreDownloadedFiles exception. To avoid this, I have to use datasets.load_dataset(…, ignore_verifications=True) which seems very hacky and potentially dangerous.
P.S.: the concrete dataset I’m trying to implement is https://huggingface.co/datasets/cjvt/cc_gigafida.