Pubmed dataset size issue

Neuroinformatica · March 13, 2023, 8:48am

Hello everyone

I’m trying to use the Pubmed dataset from HF repository, to perform some Information Retrieval.
However, it looks like this data set is really really huge: hundreds of GBs. Moreover, while extracting it, the process often crashes with the error “EOFError: Compressed file ended before the end-of-stream marker was reached”.

In the dataset card, it says “There are no splits in this dataset. It is given as is.”, but I ask anyway if there’s a way to download only a part of it, or maybe there is a smaller version available for test.

Thanks everyone.

mariosasko · March 15, 2023, 5:01pm

Hi! You can create a smaller version by first downloading the script and modifying the URL range in it and then loading the dataset with load_dataset("path/to/script").

Topic		Replies	Views
Download only a subset of a split 🤗Datasets	10	16713	February 25, 2025
Small python dataset 🤗Datasets	1	1061	May 7, 2022
Download only 1 of many parquet file 🤗Datasets	2	226	March 19, 2025
Downloading a portion of parquet files 🤗Datasets	3	668	May 23, 2024
Downloading a subset of the Pile Beginners	1	732	August 23, 2024

Pubmed dataset size issue

Related topics