Pubmed dataset size issue

Hello everyone

I’m trying to use the Pubmed dataset from HF repository, to perform some Information Retrieval.
However, it looks like this data set is really really huge: hundreds of GBs. Moreover, while extracting it, the process often crashes with the error “EOFError: Compressed file ended before the end-of-stream marker was reached”.

In the dataset card, it says “There are no splits in this dataset. It is given as is.”, but I ask anyway if there’s a way to download only a part of it, or maybe there is a smaller version available for test.

Thanks everyone.

1 Like

Hi! You can create a smaller version by first downloading the script and modifying the URL range in it and then loading the dataset with load_dataset("path/to/script").

1 Like