Why does loading load_dataset('wiki_snippets', name='wiki40b_en_100_0') takes 3 hours when it only generates 12GB of data?

Hello everyone,
I am new here and I am trying to build a LongForm Question Answering system and decided to follow Long_Form_Question_Answering_with_ELI5_and_Wikipedia.ipynb to get an idea where to begin. I had loaded the wiki40b_en prior to this and decided to load the wiki_snippets version to play around with before deciding what to finally use.
wiki40b_en took 4min to generate 9.42GB of data but the wiki_snippets version is set to take 3hours for some reason.
I would like to know what is the cause of this. Might it be the case that the wiki_snippets are actually computed by the load dataset from wiki40b_en and the processing from wiki40b_en to wiki_snippets is what is taking the majority of the 3hrs?

Hi ! You can check the code that loads this dataset here: wiki_snippets.py · wiki_snippets at main

There seems to be some data manipulation/processing involved that can explain why it’s slower than other datasets that are already well-formatted.

1 Like