Pubmed dataset size issue

Hello everyone

I’m trying to use the Pubmed dataset from HF repository, to perform some Information Retrieval.
However, it looks like this data set is really really huge: hundreds of GBs. Moreover, while extracting it, the process often crashes with the error ā€œEOFError: Compressed file ended before the end-of-stream marker was reachedā€.

In the dataset card, it says ā€œThere are no splits in this dataset. It is given as is.ā€, but I ask anyway if there’s a way to download only a part of it, or maybe there is a smaller version available for test.

Thanks everyone.

1 Like

Hi! You can create a smaller version by first downloading the script and modifying the URL range in it and then loading the dataset with load_dataset("path/to/script").

1 Like