Hello everyone,
I am new here and I am trying to build a LongForm Question Answering system and decided to follow Long_Form_Question_Answering_with_ELI5_and_Wikipedia.ipynb to get an idea where to begin. I had loaded the wiki40b_en
prior to this and decided to load the wiki_snippets version to play around with before deciding what to finally use.
wiki40b_en
took 4min to generate 9.42GB of data but the wiki_snippets
version is set to take 3hours for some reason.
I would like to know what is the cause of this. Might it be the case that the wiki_snippets
are actually computed by the load dataset from wiki40b_en
and the processing from wiki40b_en
to wiki_snippets
is what is taking the majority of the 3hrs?
Hi ! You can check the code that loads this dataset here: wiki_snippets.py · wiki_snippets at main
There seems to be some data manipulation/processing involved that can explain why it’s slower than other datasets that are already well-formatted.
1 Like