Providing a custom data path to reuse dataset script

matejklemen · December 21, 2022, 11:45am

Hi,

I would like to create a script for a dataset that has (1) a small publicly available version and (2) a large privately available version - by default, the publicly available dataset would be downloaded.
If the user has access to the private dataset, the user would be able to provide a custom dataset path to reuse the script with that version.

I am unsure how to implement this correctly and am looking for guidance. So far, I have tried:

(1) enabling user to specify the custom data path via an environment variable. That does not work as the initially processed version of the dataset will get cached and the environment variable will not be read.

(2) enabling user to specify the custom data path via the data_dir parameter of the datasets.load_dataset() function. This semi-works, but has problems when trying to use the script for both versions of the dataset, i.e. first loading the public dataset, then the private. The script throws a datasets.utils.info_utils.ExpectedMoreDownloadedFiles exception. To avoid this, I have to use datasets.load_dataset(…, ignore_verifications=True) which seems very hacky and potentially dangerous.

P.S.: the concrete dataset I’m trying to implement is https://huggingface.co/datasets/cjvt/cc_gigafida.

mariosasko · December 21, 2022, 4:02pm

Hi!

You can address this by creating two configs: one for the public version and another for the private version, and making the latter work with the “manual download” functionality:

  @property
  def manual_download_instructions(self):
      if self.config.name == "<private dataset name>"
          return "<download instructions for the private version>"

  def _split_generators(dl_manager):
      if self.config.name == "<public dataset name>"
          # download with dl_manager
      else: # private version (no download)
          data_dir = dl_manager.manual_dir # equal to `data_dir` in `load_dataset`
      ...

Feel free to ping me directly on the Hub (via a discussion) if you need help with the implementation.

Topic		Replies	Views
Writing custom dataset script with files residing in local 🤗Datasets	1	352	June 28, 2023
Specifying download directory for custom dataset loading script 🤗Datasets	6	17479	May 2, 2023
Custom data download and saving 🤗Datasets	0	393	June 2, 2023
How to download files stored in repo of dataset script? 🤗Datasets	1	895	March 7, 2022
How do I get the dataset loader working with multiple versions? 🤗Datasets	4	1564	November 8, 2022

Providing a custom data path to reuse dataset script

Related topics