Providing a custom data path to reuse dataset script

Hi,

I would like to create a script for a dataset that has (1) a small publicly available version and (2) a large privately available version - by default, the publicly available dataset would be downloaded.
If the user has access to the private dataset, the user would be able to provide a custom dataset path to reuse the script with that version.

I am unsure how to implement this correctly and am looking for guidance. So far, I have tried:

(1) enabling user to specify the custom data path via an environment variable. That does not work as the initially processed version of the dataset will get cached and the environment variable will not be read.

(2) enabling user to specify the custom data path via the data_dir parameter of the datasets.load_dataset() function. This semi-works, but has problems when trying to use the script for both versions of the dataset, i.e. first loading the public dataset, then the private. The script throws a datasets.utils.info_utils.ExpectedMoreDownloadedFiles exception. To avoid this, I have to use datasets.load_dataset(…, ignore_verifications=True) which seems very hacky and potentially dangerous.

P.S.: the concrete dataset I’m trying to implement is https://huggingface.co/datasets/cjvt/cc_gigafida.

Hi!

You can address this by creating two configs: one for the public version and another for the private version, and making the latter work with the “manual download” functionality:

  @property
  def manual_download_instructions(self):
      if self.config.name == "<private dataset name>"
          return "<download instructions for the private version>"

  def _split_generators(dl_manager):
      if self.config.name == "<public dataset name>"
          # download with dl_manager
      else: # private version (no download)
          data_dir = dl_manager.manual_dir # equal to `data_dir` in `load_dataset`
      ...

Feel free to ping me directly on the Hub (via a discussion) if you need help with the implementation.