Load dataset from local files with existing builder (pubmed dataset)

I currently have the 2019 pubmed corpus already downloaded locally (the .xml.gz files so the same as what is downloaded using the huggingface pubmed script), and I would like to use the existing PubMed Builder so that it extracts and formats the data correctly (I presume that the 2023 builder is compatible with 2019 data). What would be the correct way to do this ?

Could you extend the builder class and override _split_generators to skip the download from URLs and just use the folder containing the files?

def _split_generators(self, dl_manager):
        dl_dir = “replace with your downloaded folder”
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={"filenames": dl_dir},
            ),
        ]

Yes I actually tried the following and it worked ( the downloaded files were still gzipped so I still need to use the dl_manager extract function)

I replaced line 42:

_URLs = [f"https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed23n{i:04d}.xml.gz" for i in range(1, 1167)]

And line 318:

     dl_dir = dl_manager.download_and_extract(_URLs)

by

_LOCAL_FILES = [f"/local/path/to/pubmed/pubmed19n{i:04d}.xml.gz" for i in range(1, 972)]

...

     dl_dir = dl_manager.extract(_LOCAL_FILES)

Thanks !

1 Like