Using config_kwargs within the load_dataset

Trying to dynamically load datasets for training in S3 buckets. These will be json files that are in folders within an S3 bucket.

In my main training script, I have this:

train_ds, dev_ds, test_ds = load_dataset(path='.../datasets/hf_datasets/custom.py', split=["train", "validation", "test"])

Within my custom.py script, I need to replace the {SubfolderDatasetName} with the name of my dataset. How can I pass that dataset name when I call the load_dataset function in my main training script? Looking at the documentation, config_kwargs** seems like it might be the answer but I haven’t been able to find any examples on how to utilize that.

class Custom(datasets.GeneratorBasedBuilder):
    _URL = "https://test1234.s3.amazonaws.com/{SubfolderDatasetName}/"
    _URLS = {
        "train": _URL + "train.json",
        "dev": _URL + "dev.json",
    }

 def _split_generators(self, dl_manager):

        urls_to_download = self._URLS
        print(self._info)
        downloaded_files = dl_manager.download_and_extract(urls_to_download)
        print(downloaded_files)

        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={"filepath": os.path.join(dl_dir, "docvqa_En", "train.json")},
            ),
            datasets.SplitGenerator(
                name=datasets.Split.VALIDATION,
                gen_kwargs={"filepath": os.path.join(dl_dir, "docvqa_En", "dev.json")},
            ),
            datasets.SplitGenerator(
                name=datasets.Split.TEST,
                gen_kwargs={"filepath": os.path.join(dl_dir, "docvqa_En", "test.json")},
            ),
        ]

Found sub-optimal solution to pass a parameter into the hf_datasets/custom.py script.

config_kwargs = {}
config_kwargs['description'] =  data_args.dataset_name
train_ds, dev_ds, test_ds = load_dataset(path='/Users/john.leite/miniforge3/envs/moon/lib/python3.10/site-packages/paddlenlp/datasets/hf_datasets/custom.py', split=["train", "validation", "test"],**config_kwargs)

For some reason I can only pass either name,version or description as a config_kwargs without receiving an error:

TypeError: BuilderConfig.__init__() got an unexpected keyword argument

WIthin my custom.py, this comes across as the config_id and I can split the string to get the dataset name which I use to assign the subfolder.

    def _split_generators(self, dl_manager):

        print(self.config_id)
        DSName = self.config_id.split("=")[1]

Not pretty and hopefully I can figure out how to do this better in the future.

For anyone that comes across this in the future, here’s the GitHub discussion that talked about this (default config name doesn't work when config kwargs are specified. · Issue #6130 · huggingface/datasets · GitHub).

Passing in your kwargs in load_dataset.

ds = datasets.load_dataset(custom_keyword1=0, custom_keyword2=1)

Accessing the kwargs passed from load_dataset.

class CustomConfig(datasets.BuilderConfig):
    def __init__(self, **kwargs):
        self.custom_keyword1 = kwargs.pop("custom_keyword1", <your-default-value>)
        self.custom_keyword2 = kwargs.pop("custom_keyword2", <your-default-value>)
        super(CustomConfig, self).__init__(**kwargs)


class CustomDataset(datasets.GeneratorBasedBuilder):
    BUILDER_CONFIGS = [
        CustomConfig(name="custom_config", version="1.0.0", description="your description"), ...
    ]    # Configs initialization
    BUILDER_CONFIG_CLASS = CustomConfig    # Must specify this to use custom config

    def _info(self):
        print(self.custom_keyword1, self.custom_keyword2)    # You can access the specified kwargs anywhere in the class instance method
        ...

    def _split_generators(self, dl_manager):
        print(self.custom_keyword1, self.custom_keyword2)    # You can access the specified kwargs anywhere in the class instance method
        ...

    def _generate_examples(self, filepaths):
        print(self.custom_keyword1, self.custom_keyword2)    # You can access the specified kwargs anywhere in the class instance method
        ...