Pubmed dataset size issue

Hello everyone

Iā€™m trying to use the Pubmed dataset from HF repository, to perform some Information Retrieval.
However, it looks like this data set is really really huge: hundreds of GBs. Moreover, while extracting it, the process often crashes with the error ā€œEOFError: Compressed file ended before the end-of-stream marker was reachedā€.

In the dataset card, it says ā€œThere are no splits in this dataset. It is given as is.ā€, but I ask anyway if thereā€™s a way to download only a part of it, or maybe there is a smaller version available for test.

Thanks everyone.

1 Like

Hi! You can create a smaller version by first downloading the script and modifying the URL range in it and then loading the dataset with load_dataset("path/to/script").

1 Like