My dataset loading script is not working

Hello, I am trying to get my first dataset working. Here is the link. But I can’t get it to work. Any help would be awesome.

My csv file head:

Class,BBox,file_name
acute_diverticulitis,122-137-176-182,1188233_Seri5_30423865057.png
healthy,,1187990_Seri4_30211303061.png
healthy,,1188237_Seri2_30425664485.png

I think this part is the problem in the script:

    def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]:
        data_dir = dl_manager.download_and_extract(_URLS)
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "df_path": dl_manager.iter_files([data_dir["train_df"]]),
                    "split": "train",
                    "images_path": dl_manager.iter_archive([data_dir["train"]]),
                },
            ),
            datasets.SplitGenerator(
                name=datasets.Split.TEST,
                gen_kwargs={
                    "df_path": dl_manager.iter_files([data_dir["test_df"]]),
                    "split": "test",
                    "images_path": dl_manager.iter_archive([data_dir["test"]]),
                },
            ),
        ]

    def _generate_examples(self, df_path, split, images_path):
        dataframe = pd.read_csv(df_path, delimiter=",")

        for id_, row in dataframe.iterrows():
            yield id_, {
                "image": os.path.join(images_path, row["image"]),
                "bounding-box": row["BBox"],
                "class": row["Class"],
            }

Hello! I think the issue is with data_dir = dl_manager.download_and_extract(_URLS). The error on your dataset page (osbm/abdominal_mri_images · Datasets at Hugging Face) says:

NotImplementedError: Extraction protocol for TAR archives like 'https://huggingface.co/datasets/osbm/abdominal_mri_images/blob/main/train.tar.gz' is not implemented in streaming mode. Please use `dl_manager.iter_archive` instead.

I haven’t worked with this particular kind of thing before, but this issue might be helpful for you! Trouble with streaming frgfm/imagenette vision dataset with TAR archive · Issue #4697 · huggingface/datasets · GitHub

1 Like

Hi! I’ve just opened a PR to make your dataset streamable (needed for the preview): osbm/abdominal_mri_images · Make dataset streamable. Let us know if you have some additional questions/issues.

1 Like

Thank you so much. This is awesome.