Dataset Generator local files path

Hey,

I’ve got a dataset loading script inheriting from the GeneratorBasedBuilder.
I want to load local preprocessed data from different folders. Since the dataset loading script is cached I have to specify the full absolute path in the script, which is obviously not a good solution.
When I try to get the current path the path to the cached script is used, which fails to load the data.

I also tried using data_dir and other options, nut none worked. How can i get the current dir in the script?

class MyDataset(datasets.GeneratorBasedBuilder):
    ...
    def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]:
        # currentpath = os.path.abspath(os.getcwd()) #TODO resolve path auto
        # also tried os.path.abspath(__file__)
        currentpath = "/my/absolute/path/"

        generator = []
        file_train = os.path.join(currentpath, self.config.name, "train.csv")
        file_test = os.path.join(currentpath, self.config.name, "test.csv")
        file_eval = os.path.join(currentpath, self.config.name, "valid.csv")

        if os.path.isfile(file_train):
            train = datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "filepath": file_train,
                    "split": "train",
                },
            )
            generator.append(train)

Thanks!

You can pass a relative path to the dl_manager :slight_smile:

e.g.

class MyDataset(datasets.GeneratorBasedBuilder):
    ...
    def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]:
        generator = []
        file_train = dl_manager.download(os.path.join(self.config.name, "train.csv"))
        file_test = dl_manager.download(os.path.join(self.config.name, "test.csv"))
        file_eval = dl_manager.download(os.path.join(self.config.name, "valid.csv"))

        train = datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={
                "filepath": file_train,
                "split": "train",
            },
        )
        generator.append(train)
1 Like