My dataset loading script is not working

osbm · September 11, 2022, 4:50pm

Hello, I am trying to get my first dataset working. Here is the link. But I can’t get it to work. Any help would be awesome.

My csv file head:

Class,BBox,file_name
acute_diverticulitis,122-137-176-182,1188233_Seri5_30423865057.png
healthy,,1187990_Seri4_30211303061.png
healthy,,1188237_Seri2_30425664485.png

I think this part is the problem in the script:

    def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]:
        data_dir = dl_manager.download_and_extract(_URLS)
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "df_path": dl_manager.iter_files([data_dir["train_df"]]),
                    "split": "train",
                    "images_path": dl_manager.iter_archive([data_dir["train"]]),
                },
            ),
            datasets.SplitGenerator(
                name=datasets.Split.TEST,
                gen_kwargs={
                    "df_path": dl_manager.iter_files([data_dir["test_df"]]),
                    "split": "test",
                    "images_path": dl_manager.iter_archive([data_dir["test"]]),
                },
            ),
        ]

    def _generate_examples(self, df_path, split, images_path):
        dataframe = pd.read_csv(df_path, delimiter=",")

        for id_, row in dataframe.iterrows():
            yield id_, {
                "image": os.path.join(images_path, row["image"]),
                "bounding-box": row["BBox"],
                "class": row["Class"],
            }

NimaBoscarino · September 12, 2022, 9:17am

Hello! I think the issue is with data_dir = dl_manager.download_and_extract(_URLS). The error on your dataset page (osbm/abdominal_mri_images · Datasets at Hugging Face) says:

NotImplementedError: Extraction protocol for TAR archives like 'https://huggingface.co/datasets/osbm/abdominal_mri_images/blob/main/train.tar.gz' is not implemented in streaming mode. Please use `dl_manager.iter_archive` instead.

I haven’t worked with this particular kind of thing before, but this issue might be helpful for you! Trouble with streaming frgfm/imagenette vision dataset with TAR archive · Issue #4697 · huggingface/datasets · GitHub

mariosasko · September 15, 2022, 12:42pm

Hi! I’ve just opened a PR to make your dataset streamable (needed for the preview): osbm/abdominal_mri_images · Make dataset streamable. Let us know if you have some additional questions/issues.

osbm · September 15, 2022, 12:53pm

Thank you so much. This is awesome.

Topic		Replies	Views
Image Dataset Generation gets killed 🤗Datasets	5	583	September 8, 2023
Splitting Dataset in the dataset loading script 🤗Datasets	1	600	September 16, 2022
Dataset Generator local files path 🤗Datasets	1	724	August 14, 2023
ValueError: all is a special split keyword corresponding to the union of all splits Beginners	0	130	April 14, 2024
Dataset loading script not working 🤗Datasets	2	431	August 31, 2023

My dataset loading script is not working

Related topics