I am new here and I am trying to build a LongForm Question Answering system and decided to follow Long_Form_Question_Answering_with_ELI5_and_Wikipedia.ipynb to get an idea where to begin. I had loaded the
wiki40b_en prior to this and decided to load the wiki_snippets version to play around with before deciding what to finally use.
wiki40b_en took 4min to generate 9.42GB of data but the
wiki_snippets version is set to take 3hours for some reason.
I would like to know what is the cause of this. Might it be the case that the
wiki_snippets are actually computed by the load dataset from
wiki40b_en and the processing from
wiki_snippets is what is taking the majority of the 3hrs?
Hi ! You can check the code that loads this dataset here: wiki_snippets.py · wiki_snippets at main
There seems to be some data manipulation/processing involved that can explain why it’s slower than other datasets that are already well-formatted.